Exploratory analysis and linear regression analysis of abnormal price changes and surprise percentages over different day windows

In [38]:
import pandas as pd
In [39]:
df_price_changes = pd.read_csv('Adjusted return Drifts.csv')
In [40]:
df_price_changes
Out[40]:
APPL - 3 Day Drift Change - abnormal APPL - 5 Day Drift Change - abnormal APPL - 10 Day Drift Change - abnormal surprise surprisePercentage
0 -2.53% -5.17% -2.50% 0.0550 73.3333
1 3.54% 2.93% 2.47% 0.0325 37.1429
2 -0.85% -0.27% -1.73% 0.0150 13.6364
3 -1.22% -1.90% -3.30% 0.0200 13.7931
4 -3.69% 0.03% -0.18% 0.0375 19.4805
... ... ... ... ... ...
58 -3.75% -2.47% -1.35% 0.0600 4.4776
59 -0.70% -2.21% -1.35% 0.0200 2.1053
60 -1.31% -1.89% 1.22% 0.0600 2.5641
61 -1.93% -3.43% -0.94% 0.0300 1.8519
62 -0.71% 7.09% 11.46% 0.1400 9.7902

63 rows × 5 columns

In [41]:
df_price_changes['APPL - 3 Day Drift Change - abnormal'] = (
    df_price_changes['APPL - 3 Day Drift Change - abnormal']
    .astype(str)
    .str.replace('%', '', regex=False)
    .astype(float)
)

df_price_changes['APPL - 5 Day Drift Change - abnormal'] = (
    df_price_changes['APPL - 5 Day Drift Change - abnormal']
    .astype(str)
    .str.replace('%', '', regex=False)
    .astype(float)
)

df_price_changes['APPL - 10 Day Drift Change - abnormal'] = (
    df_price_changes['APPL - 10 Day Drift Change - abnormal']
    .astype(str)
    .str.replace('%', '', regex=False)
    .astype(float)
)
In [42]:
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import numpy as np

# Define X and y values
X = df_price_changes[['surprisePercentage']]   # must be 2D for sklearn
y = df_price_changes['APPL - 3 Day Drift Change - abnormal']

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Extract parameters
slope = model.coef_[0]
intercept = model.intercept_
r_squared = model.score(X, y)

print(f"Slope: {slope:.4f}")
print(f"Intercept: {intercept:.4f}")
print(f"R-squared: {r_squared:.4f}")

# Make predictions
y_pred = model.predict(X)

# Plot visualisation
plt.figure(figsize=(8,5))
plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X, y_pred, color='red', linewidth=2, label='Regression line')
plt.title('Linear Regression: Abnormal Returns vs Surprise Percentage')
plt.xlabel('Surprise Percentage')
plt.ylabel('APPL - 3 Day Drift Change - abnormal')
plt.legend()
plt.show()
Slope: 0.0104
Intercept: -0.2896
R-squared: 0.0048
No description has been provided for this image
In [43]:
# Assuming df_price_changes is already loaded
# Define X and y
X = df_price_changes[['surprisePercentage']]   # must be 2D for sklearn
y = df_price_changes['APPL - 5 Day Drift Change - abnormal']

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Extract parameters
slope = model.coef_[0]
intercept = model.intercept_
r_squared = model.score(X, y)

print(f"Slope: {slope:.4f}")
print(f"Intercept: {intercept:.4f}")
print(f"R-squared: {r_squared:.4f}")

# Make predictions
y_pred = model.predict(X)

# Plot
plt.figure(figsize=(8,5))
plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X, y_pred, color='red', linewidth=2, label='Regression line')
plt.title('Linear Regression: Abnormal Returns vs Surprise Percentage')
plt.xlabel('Surprise Percentage')
plt.ylabel('APPL - 5 Day Drift Change - abnormal')
plt.legend()
plt.show()
Slope: -0.0219
Intercept: 0.5670
R-squared: 0.0098
No description has been provided for this image
In [44]:
# Assuming df_price_changes is already loaded
# Define X and y
X = df_price_changes[['surprisePercentage']]   # must be 2D for sklearn
y = df_price_changes['APPL - 10 Day Drift Change - abnormal']

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Extract parameters
slope = model.coef_[0]
intercept = model.intercept_
r_squared = model.score(X, y)

print(f"Slope: {slope:.4f}")
print(f"Intercept: {intercept:.4f}")
print(f"R-squared: {r_squared:.4f}")

# Make predictions
y_pred = model.predict(X)

# Plot
plt.figure(figsize=(8,5))
plt.scatter(X, y, color='blue', label='Actual data')
plt.plot(X, y_pred, color='red', linewidth=2, label='Regression line')
plt.title('Linear Regression: Abnormal Returns vs Surprise Percentage')
plt.xlabel('Surprise Percentage')
plt.ylabel('APPL - 10 Day Drift Change - abnormal')
plt.legend()
plt.show()
Slope: -0.0072
Intercept: 0.6145
R-squared: 0.0005
No description has been provided for this image

Multilinear regressions across separate day windows - Exploratory models using manually cleaned spreadsheet data

In [91]:
df_price_changes_multilinear = pd.read_csv('Abnormal Returns - Multi Linear regressions.csv')
In [92]:
df_price_changes_multilinear['APPL - 3 Day Drift Change - abnormal'] = (
    df_price_changes_multilinear['APPL - 3 Day Drift Change - abnormal']
    .astype(str)
    .str.replace('%', '', regex=False)
    .astype(float)
)

df_price_changes_multilinear['APPL - 5 Day Drift Change - abnormal'] = (
    df_price_changes_multilinear['APPL - 5 Day Drift Change - abnormal']
    .astype(str)
    .str.replace('%', '', regex=False)
    .astype(float)
)
df_price_changes_multilinear['APPL - 10 Day Drift Change - abnormal'] = (
    df_price_changes_multilinear['APPL - 10 Day Drift Change - abnormal']
    .astype(str)
    .str.replace('%', '', regex=False)
    .astype(float)
)

df_price_changes_multilinear['APPL - 3 Day before announcement change - abnormal'] = (
    df_price_changes_multilinear['APPL - 3 Day before announcement change - abnormal']
    .astype(str)
    .str.replace('%', '', regex=False)
    .astype(float)
)

df_price_changes_multilinear['APPL - 10 Day before announcement change - abnormal'] = (
    df_price_changes_multilinear['APPL - 10 Day before announcement change - abnormal']
    .astype(str)
    .str.replace('%', '', regex=False)
    .astype(float)
)

df_price_changes_multilinear['APPL - 20 Day Drift Change - abnormal'] = (
    df_price_changes_multilinear['APPL - 20 Day Drift Change - abnormal']
    .astype(str)
    .str.replace('%', '', regex=False)
    .astype(float)
)
In [93]:
df_price_changes_multilinear
Out[93]:
Date First day of month Adj Close Close High Low Open Volume APPL - daily change APPL - daily change - abnormal ... CPI fiscalDateEnding reportedDate reportedEPS estimatedEPS surprise surprisePercentage reportTime symbol totalRevenue
0 25/01/2010 01/01/2010 6.096186 7.252500 7.310714 7.149643 7.232500 1065699600 2.69% 2.23% ... 216.687 31/12/2009 25/01/2010 0.130 0.0750 0.0550 73.3333 post-market AAPL 1.568300e+10
1 20/04/2010 01/04/2010 7.342619 8.735357 8.901786 8.677143 8.876429 738326400 -1.00% -1.81% ... 218.009 31/03/2010 20/04/2010 0.120 0.0875 0.0325 37.1429 post-market AAPL 1.349900e+10
2 20/07/2010 01/07/2010 7.561765 8.996071 9.032143 8.571786 8.675000 1074950800 2.57% 1.43% ... 218.011 30/06/2010 20/07/2010 0.125 0.1100 0.0150 13.6364 post-market AAPL 1.570000e+10
3 18/10/2010 01/10/2010 9.546394 11.357143 11.392857 11.224643 11.373929 1093010800 1.04% 0.31% ... 218.711 30/09/2010 18/10/2010 0.165 0.1450 0.0200 13.7931 post-market AAPL 2.034300e+10
4 18/01/2011 01/01/2011 10.226353 12.166071 12.312857 11.642857 11.768571 1880998000 -2.25% -2.38% ... 220.223 31/12/2010 18/01/2011 0.230 0.1925 0.0375 19.4805 post-market AAPL 2.674100e+10
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
58 01/08/2024 01/08/2024 217.097168 218.360001 224.479996 217.020004 224.369995 62501000 -1.68% -0.31% ... 314.796 30/06/2024 01/08/2024 1.400 1.3400 0.0600 4.4776 post-market AAPL 8.577700e+10
59 31/10/2024 01/10/2024 224.863480 225.910004 229.830002 225.369995 229.339996 64370100 -1.82% 0.04% ... 315.664 30/09/2024 31/10/2024 0.970 0.9500 0.0200 2.1053 post-market AAPL 9.493000e+10
60 30/01/2025 01/01/2025 236.749542 237.589996 240.789993 237.210007 238.669998 55658300 -0.74% -1.27% ... 317.671 31/12/2024 30/01/2025 2.400 2.3400 0.0600 2.5641 post-market AAPL 1.240000e+11
61 01/05/2025 01/05/2025 212.799133 213.320007 214.559998 208.899994 209.080002 57365700 0.39% -0.24% ... 321.465 31/03/2025 01/05/2025 1.650 1.6200 0.0300 1.8519 post-market AAPL 9.535900e+10
62 31/07/2025 01/07/2025 207.334701 207.570007 209.839996 207.160004 208.490005 80698400 -0.71% -0.34% ... 323.048 30/06/2025 31/07/2025 1.570 1.4300 0.1400 9.7902 post-market AAPL 9.403600e+10

63 rows × 39 columns

In [94]:
pip install pandas statsmodels
Requirement already satisfied: pandas in c:\users\aledr\anaconda3\lib\site-packages (2.2.3)Note: you may need to restart the kernel to use updated packages.

Requirement already satisfied: statsmodels in c:\users\aledr\anaconda3\lib\site-packages (0.14.4)
Requirement already satisfied: numpy>=1.26.0 in c:\users\aledr\anaconda3\lib\site-packages (from pandas) (2.1.3)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\aledr\anaconda3\lib\site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\aledr\anaconda3\lib\site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in c:\users\aledr\anaconda3\lib\site-packages (from pandas) (2025.2)
Requirement already satisfied: scipy!=1.9.2,>=1.8 in c:\users\aledr\anaconda3\lib\site-packages (from statsmodels) (1.15.3)
Requirement already satisfied: patsy>=0.5.6 in c:\users\aledr\anaconda3\lib\site-packages (from statsmodels) (1.0.1)
Requirement already satisfied: packaging>=21.3 in c:\users\aledr\anaconda3\lib\site-packages (from statsmodels) (24.2)
Requirement already satisfied: six>=1.5 in c:\users\aledr\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
In [96]:
# Define the independent variables (same for all three)
X = df_price_changes_multilinear[['surprisePercentage',
                                  'Vix - Close - Pre Day',
                                  'Fed Funds Rate',
                                  'APPL - 10 Day before announcement change - abnormal']]

# Add a constant (intercept)
X = sm.add_constant(X)

# ----------------------------------------------------------
# Model 1 — 3-Day Drift
# ----------------------------------------------------------
y1 = df_price_changes_multilinear['APPL - 3 Day Drift Change - abnormal']
model1 = sm.OLS(y1, X).fit()
print("=== Model 1: 3-Day Drift ===")
print(model1.summary())
print("\n")

# ----------------------------------------------------------
# Model 2 — 5-Day Drift
# ----------------------------------------------------------
y2 = df_price_changes_multilinear['APPL - 5 Day Drift Change - abnormal']
model2 = sm.OLS(y2, X).fit()
print("=== Model 2: 5-Day Drift ===")
print(model2.summary())
print("\n")

# ----------------------------------------------------------
# Model 3 — 10-Day Drift
# ----------------------------------------------------------
y3 = df_price_changes_multilinear['APPL - 10 Day Drift Change - abnormal']
model3 = sm.OLS(y3, X).fit()
print("=== Model 3: 10-Day Drift ===")
print(model3.summary())

# ----------------------------------------------------------
# Model 4 — 20-Day Drift
# ----------------------------------------------------------
y3 = df_price_changes_multilinear['APPL - 20 Day Drift Change - abnormal']
model3 = sm.OLS(y3, X).fit()
print("=== Model 4: 20-Day Drift ===")
print(model3.summary())
=== Model 1: 3-Day Drift ===
                                     OLS Regression Results                                     
================================================================================================
Dep. Variable:     APPL - 3 Day Drift Change - abnormal   R-squared:                       0.113
Model:                                              OLS   Adj. R-squared:                  0.052
Method:                                   Least Squares   F-statistic:                     1.843
Date:                                  Fri, 31 Oct 2025   Prob (F-statistic):              0.133
Time:                                          15:58:24   Log-Likelihood:                -128.73
No. Observations:                                    63   AIC:                             267.5
Df Residuals:                                        58   BIC:                             278.2
Df Model:                                             4                                         
Covariance Type:                              nonrobust                                         
=======================================================================================================================
                                                          coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------------
const                                                   1.4165      0.808      1.753      0.085      -0.201       3.034
surprisePercentage                                      0.0105      0.019      0.538      0.593      -0.028       0.049
Vix - Close - Pre Day                                  -0.0802      0.040     -2.019      0.048      -0.160      -0.001
Fed Funds Rate                                         -0.1535      0.142     -1.080      0.285      -0.438       0.131
APPL - 10 Day before announcement change - abnormal    -0.0757      0.070     -1.086      0.282      -0.215       0.064
==============================================================================
Omnibus:                        0.527   Durbin-Watson:                   2.084
Prob(Omnibus):                  0.768   Jarque-Bera (JB):                0.673
Skew:                           0.137   Prob(JB):                        0.714
Kurtosis:                       2.574   Cond. No.                         74.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


=== Model 2: 5-Day Drift ===
                                     OLS Regression Results                                     
================================================================================================
Dep. Variable:     APPL - 5 Day Drift Change - abnormal   R-squared:                       0.084
Model:                                              OLS   Adj. R-squared:                  0.021
Method:                                   Least Squares   F-statistic:                     1.329
Date:                                  Fri, 31 Oct 2025   Prob (F-statistic):              0.270
Time:                                          15:58:24   Log-Likelihood:                -154.51
No. Observations:                                    63   AIC:                             319.0
Df Residuals:                                        58   BIC:                             329.7
Df Model:                                             4                                         
Covariance Type:                              nonrobust                                         
=======================================================================================================================
                                                          coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------------
const                                                   2.5718      1.217      2.114      0.039       0.136       5.007
surprisePercentage                                     -0.0289      0.029     -0.987      0.328      -0.088       0.030
Vix - Close - Pre Day                                  -0.0799      0.060     -1.337      0.186      -0.200       0.040
Fed Funds Rate                                         -0.3304      0.214     -1.544      0.128      -0.759       0.098
APPL - 10 Day before announcement change - abnormal    -0.0651      0.105     -0.620      0.538      -0.275       0.145
==============================================================================
Omnibus:                        0.663   Durbin-Watson:                   2.115
Prob(Omnibus):                  0.718   Jarque-Bera (JB):                0.279
Skew:                           0.147   Prob(JB):                        0.870
Kurtosis:                       3.140   Cond. No.                         74.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


=== Model 3: 10-Day Drift ===
                                      OLS Regression Results                                     
=================================================================================================
Dep. Variable:     APPL - 10 Day Drift Change - abnormal   R-squared:                       0.022
Model:                                               OLS   Adj. R-squared:                 -0.045
Method:                                    Least Squares   F-statistic:                    0.3285
Date:                                   Fri, 31 Oct 2025   Prob (F-statistic):              0.858
Time:                                           15:58:24   Log-Likelihood:                -179.62
No. Observations:                                     63   AIC:                             369.2
Df Residuals:                                         58   BIC:                             379.9
Df Model:                                              4                                         
Covariance Type:                               nonrobust                                         
=======================================================================================================================
                                                          coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------------
const                                                   1.9140      1.813      1.056      0.295      -1.714       5.542
surprisePercentage                                     -0.0067      0.044     -0.153      0.879      -0.094       0.081
Vix - Close - Pre Day                                  -0.0602      0.089     -0.676      0.502      -0.239       0.118
Fed Funds Rate                                         -0.1161      0.319     -0.364      0.717      -0.754       0.522
APPL - 10 Day before announcement change - abnormal    -0.1146      0.156     -0.733      0.467      -0.428       0.198
==============================================================================
Omnibus:                        0.105   Durbin-Watson:                   2.047
Prob(Omnibus):                  0.949   Jarque-Bera (JB):                0.061
Skew:                           0.064   Prob(JB):                        0.970
Kurtosis:                       2.918   Cond. No.                         74.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
=== Model 4: 20-Day Drift ===
                                      OLS Regression Results                                     
=================================================================================================
Dep. Variable:     APPL - 20 Day Drift Change - abnormal   R-squared:                       0.138
Model:                                               OLS   Adj. R-squared:                  0.079
Method:                                    Least Squares   F-statistic:                     2.327
Date:                                   Fri, 31 Oct 2025   Prob (F-statistic):             0.0668
Time:                                           15:58:24   Log-Likelihood:                -198.11
No. Observations:                                     63   AIC:                             406.2
Df Residuals:                                         58   BIC:                             416.9
Df Model:                                              4                                         
Covariance Type:                               nonrobust                                         
=======================================================================================================================
                                                          coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------------------------------------
const                                                   5.8825      2.431      2.420      0.019       1.017      10.748
surprisePercentage                                      0.0164      0.059      0.281      0.780      -0.101       0.134
Vix - Close - Pre Day                                  -0.2501      0.119     -2.094      0.041      -0.489      -0.011
Fed Funds Rate                                         -0.2891      0.428     -0.676      0.502      -1.145       0.567
APPL - 10 Day before announcement change - abnormal    -0.3757      0.210     -1.791      0.078      -0.795       0.044
==============================================================================
Omnibus:                        0.257   Durbin-Watson:                   2.402
Prob(Omnibus):                  0.879   Jarque-Bera (JB):                0.069
Skew:                          -0.080   Prob(JB):                        0.966
Kurtosis:                       3.019   Cond. No.                         74.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Exploratory analysis of price changes, earnings surprises and 3 day drift change

In [3]:
import pandas as pd
In [3]:
dataset = pd.read_excel('Copy of Apple_Master_Sheet_1.xlsx')
In [4]:
dataset
Out[4]:
Unnamed: 0 Date First day of month Adj Close Close High Low Open Volume APPL - daily change ... CPI fiscalDateEnding reportedDate reportedEPS estimatedEPS surprise surprisePercentage reportTime symbol totalRevenue
0 0 2010-01-04 2010-01-01 6.424606 7.643214 7.660714 7.585000 7.622500 493729600 0.000000 ... 216.687 NaT NaT NaN NaN NaN NaN NaN NaN NaN
1 1 2010-01-05 2010-01-01 6.435713 7.656429 7.699643 7.616071 7.664286 601904800 0.001729 ... 216.687 NaT NaT NaN NaN NaN NaN NaN NaN NaN
2 2 2010-01-06 2010-01-01 6.333344 7.534643 7.686786 7.526786 7.656429 552160000 -0.015906 ... 216.687 NaT NaT NaN NaN NaN NaN NaN NaN NaN
3 3 2010-01-07 2010-01-01 6.321636 7.520714 7.571429 7.466071 7.562500 477131200 -0.001849 ... 216.687 NaT NaT NaN NaN NaN NaN NaN NaN NaN
4 4 2010-01-08 2010-01-01 6.363664 7.570714 7.571429 7.466429 7.510714 447610800 0.006648 ... 216.687 NaT NaT NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3969 3969 2025-10-14 2025-10-01 247.770004 247.770004 248.850006 244.699997 246.600006 35478000 0.000444 ... NaN NaT NaT NaN NaN NaN NaN NaN NaN NaN
3970 3970 2025-10-15 2025-10-01 249.339996 249.339996 251.820007 247.470001 249.490005 33893600 0.006336 ... NaN NaT NaT NaN NaN NaN NaN NaN NaN NaN
3971 3971 2025-10-16 2025-10-01 247.449997 247.449997 249.039993 245.130005 248.250000 39777000 -0.007580 ... NaN NaT NaT NaN NaN NaN NaN NaN NaN NaN
3972 3972 2025-10-17 2025-10-01 252.289993 252.289993 253.380005 247.270004 248.020004 49147000 0.019559 ... NaN NaT NaT NaN NaN NaN NaN NaN NaN NaN
3973 3973 2025-10-20 2025-10-01 262.239990 262.239990 264.380005 255.630005 255.889999 90370300 0.039439 ... NaN NaT NaT NaN NaN NaN NaN NaN NaN NaN

3974 rows × 45 columns

In [9]:
# Make sure Date is datetime
dataset['Date'] = pd.to_datetime(dataset['Date'])

fig, ax1 = plt.subplots(figsize=(10, 6))

# Left y-axis: Adj Close
ax1.plot(dataset['Date'], dataset['Adj Close'], color='blue', label='Adj Close')
ax1.set_xlabel('Date')
ax1.set_ylabel('Adj Close', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

# Right y-axis: only non-NaN surprisePercentage values
ax2 = ax1.twinx()
mask = dataset['surprisePercentage'].notna()
ax2.plot(dataset.loc[mask, 'Date'],
         dataset.loc[mask, 'surprisePercentage'],
         color='red', marker='o', linestyle='-', linewidth=2, label='Surprise %')
ax2.set_ylabel('Surprise %', color='red')
ax2.tick_params(axis='y', labelcolor='red')

plt.title('Adj Close vs Surprise Percentage (Only Available Dates)')
lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines + lines2, labels + labels2, loc='upper left')
plt.tight_layout()
plt.show()
No description has been provided for this image
In [13]:
mask = dataset['surprisePercentage'].notna() & dataset['APPL - 3 Day Drift Change - abnormal'].notna()
compare_df = dataset.loc[mask, ['Date', 'surprisePercentage', 'APPL - 3 Day Drift Change - abnormal']]
fig, ax1 = plt.subplots(figsize=(10,6))

ax1.plot(compare_df['Date'], compare_df['surprisePercentage'], color='blue', marker='o', label='Surprise %')
ax1.set_xlabel('Date')
ax1.set_ylabel('Surprise %', color='blue')
ax1.tick_params(axis='y', labelcolor='blue')

ax2 = ax1.twinx()
ax2.plot(compare_df['Date'], compare_df['APPL - 3 Day Drift Change - abnormal'], color='red', marker='x', label='3-Day Drift Change (Abnormal)')
ax2.set_ylabel('3-Day Drift Change (Abnormal)', color='red')
ax2.tick_params(axis='y', labelcolor='red')

plt.title('Surprise % vs 3-Day Drift Change (Abnormal)')
lines, labels = ax1.get_legend_handles_labels()
lines2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(lines + lines2, labels + labels2, loc='upper left')

plt.tight_layout()
plt.show()
No description has been provided for this image

Histograms for earnings percentages

In [4]:
appl_earnings = pd.read_csv('appl_earnings.csv')
googl_earnings = pd.read_csv('googl_earnings.csv')
nvda_earnings = pd.read_csv('nvda_earnings.csv')
In [5]:
import matplotlib.pyplot as plt

# Make the figure larger
plt.figure(figsize=(12, 4))

# Apple
plt.subplot(1, 3, 1)
plt.hist(appl_earnings['surprisePercentage'], bins=15, color='skyblue', edgecolor='black')
plt.title('AAPL: Surprise %')
plt.xlabel('Surprise Percentage')
plt.ylabel('Frequency')

# Google
plt.subplot(1, 3, 2)
plt.hist(googl_earnings['surprisePercentage'], bins=15, color='lightgreen', edgecolor='black')
plt.title('GOOGL: Surprise %')
plt.xlabel('Surprise Percentage')

# Nvidia
plt.subplot(1, 3, 3)
plt.hist(nvda_earnings['surprisePercentage'], bins=15, color='salmon', edgecolor='black')
plt.title('NVDA: Surprise %')
plt.xlabel('Surprise Percentage')

plt.tight_layout()
plt.show()
No description has been provided for this image
In [14]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import gaussian_kde

# Clean numeric data
appl = pd.to_numeric(appl_earnings['surprisePercentage'], errors='coerce').dropna()
googl = pd.to_numeric(googl_earnings['surprisePercentage'], errors='coerce').dropna()
nvda = pd.to_numeric(nvda_earnings['surprisePercentage'], errors='coerce').dropna()

# Combine all data for shared bins
all_data = np.concatenate([appl, googl, nvda])

# More bins (e.g., 30)
if all_data.min() == all_data.max():
    bins = 10
else:
    bins = np.linspace(all_data.min(), all_data.max(), 31)  # 30 bins

plt.figure(figsize=(12, 7))

# Histograms (shared bins)
plt.hist(appl, bins=bins, alpha=0.4, label='AAPL', edgecolor='black')
plt.hist(googl, bins=bins, alpha=0.4, label='GOOGL', edgecolor='black')
plt.hist(nvda, bins=bins, alpha=0.4, label='NVDA', edgecolor='black')

# KDE Trend Lines
xs = np.linspace(all_data.min(), all_data.max(), 400)

appl_kde = gaussian_kde(appl)
googl_kde = gaussian_kde(googl)
nvda_kde = gaussian_kde(nvda)

plt.plot(xs, appl_kde(xs) * len(appl) * (bins[1] - bins[0]), label='AAPL Trend')
plt.plot(xs, googl_kde(xs) * len(googl) * (bins[1] - bins[0]), label='GOOGL Trend')
plt.plot(xs, nvda_kde(xs) * len(nvda) * (bins[1] - bins[0]), label='NVDA Trend')

# Title & labels
plt.title('Surprise % Distribution with KDE Trend Lines')
plt.xlabel('Surprise Percentage')
plt.ylabel('Frequency')
plt.legend()

plt.tight_layout()
plt.show()
No description has been provided for this image

Histograms for CAR values - Separate Windows

In [20]:
# EVENT STUDY - HISTOGRAMS BY TICKER AND WINDOW

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load the Excel file
file_path = "event_study (1).xlsx"

# Specify only the CAR sheets you need
sheets_to_load = [
    'CAR_(0,1)',
    'CAR_(0,3)',
    'CAR_(0,5)',
    'CAR_(-1,+1)',
    'CAR_(-1,+5)_ROBUST'
]

# Read those sheets into a dictionary of DataFrames
car_sheets = pd.read_excel(file_path, sheet_name=sheets_to_load)

print("Sheets loaded:", list(car_sheets.keys()))


# 2. Group by ticker and view summary statistics
summary_stats = {}

for sheet_name, df in car_sheets.items():
    # Make sure expected columns exist
    if 'ticker' not in df.columns or 'CAR' not in df.columns:
        print(f"⚠️  Skipping {sheet_name} (missing columns)")
        continue

    grouped = df.groupby('ticker')['CAR'].describe()
    summary_stats[sheet_name] = grouped

    print(f"\n=== {sheet_name} ===")
    print(grouped)


# 3. Plot histograms by ticker (separate per CAR window)
for sheet_name, df in car_sheets.items():
    if 'ticker' not in df.columns or 'CAR' not in df.columns:
        continue

    print(f"\n📊 Plotting {sheet_name}...")
    tickers = df['ticker'].unique()

    # One histogram per ticker
    for tkr in tickers:
        subset = df[df['ticker'] == tkr]

        plt.figure(figsize=(6, 4))
        plt.hist(subset['CAR'], bins=15, color='skyblue', edgecolor='black')
        plt.title(f"{sheet_name} — {tkr}")
        plt.xlabel("CAR")
        plt.ylabel("Frequency")
        plt.grid(alpha=0.3)
        plt.tight_layout()
        plt.show()


# 4. Combined histogram per CAR window (tickers overlaid)
for sheet_name, df in car_sheets.items():
    if 'ticker' not in df.columns or 'CAR' not in df.columns:
        continue

    plt.figure(figsize=(8, 5))
    sns.histplot(data=df, x='CAR', hue='ticker', bins=20, kde=True, element='step')
    plt.title(f"Distribution of CAR by Ticker — {sheet_name}")
    plt.xlabel("CAR")
    plt.ylabel("Frequency")
    plt.legend(title='Ticker')
    plt.tight_layout()
    plt.show()


# 5. (Optional) Save grouped summaries to Excel
with pd.ExcelWriter("CAR_ticker_summaries.xlsx") as writer:
    for sheet_name, grouped in summary_stats.items():
        grouped.to_excel(writer, sheet_name=sheet_name)

print("\n✅ Analysis complete — histograms displayed and summary file saved as 'CAR_ticker_summaries.xlsx'")
Sheets loaded: ['CAR_(0,1)', 'CAR_(0,3)', 'CAR_(0,5)', 'CAR_(-1,+1)', 'CAR_(-1,+5)_ROBUST']

=== CAR_(0,1) ===
        count      mean       std       min       25%       50%       75%  \
ticker                                                                      
AAPL     43.0  0.005428  0.046335 -0.089698 -0.024862  0.011972  0.035717   
GOOGL    43.0  0.003760  0.052160 -0.093655 -0.034078  0.005565  0.038648   
NVDA     43.0  0.026321  0.096231 -0.262444 -0.027790  0.000834  0.087351   

             max  
ticker            
AAPL    0.103840  
GOOGL   0.142875  
NVDA    0.242677  

=== CAR_(0,3) ===
        count      mean       std       min       25%       50%       75%  \
ticker                                                                      
AAPL     43.0  0.006503  0.052120 -0.094895 -0.024847  0.018516  0.042477   
GOOGL    43.0  0.000413  0.058373 -0.106452 -0.044546  0.001775  0.038365   
NVDA     43.0  0.028975  0.107400 -0.232482 -0.047788  0.011799  0.109112   

             max  
ticker            
AAPL    0.107871  
GOOGL   0.161237  
NVDA    0.315025  

=== CAR_(0,5) ===
        count      mean       std       min       25%       50%       75%  \
ticker                                                                      
AAPL     43.0  0.009665  0.056591 -0.102176 -0.031273  0.018055  0.049417   
GOOGL    43.0 -0.002316  0.057561 -0.122602 -0.043850 -0.007569  0.038725   
NVDA     43.0  0.024691  0.106726 -0.194454 -0.057976  0.001781  0.102786   

             max  
ticker            
AAPL    0.119699  
GOOGL   0.118061  
NVDA    0.321600  

=== CAR_(-1,+1) ===
        count      mean       std       min       25%       50%       75%  \
ticker                                                                      
AAPL     43.0  0.007897  0.044772 -0.085606 -0.018616  0.015868  0.036014   
GOOGL    43.0  0.007233  0.051769 -0.097168 -0.030333  0.004781  0.039020   
NVDA     43.0  0.023750  0.094759 -0.260275 -0.030462  0.013586  0.077743   

             max  
ticker            
AAPL    0.113222  
GOOGL   0.162359  
NVDA    0.222872  

=== CAR_(-1,+5)_ROBUST ===
        count      mean       std       min       25%       50%       75%  \
ticker                                                                      
AAPL     43.0  0.012134  0.055863 -0.093473 -0.030653  0.020592  0.055921   
GOOGL    43.0  0.001158  0.055660 -0.126114 -0.037115  0.001218  0.037533   
NVDA     43.0  0.022120  0.105125 -0.192285 -0.046309  0.007665  0.084850   

             max  
ticker            
AAPL    0.129082  
GOOGL   0.137545  
NVDA    0.294146  

📊 Plotting CAR_(0,1)...
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
📊 Plotting CAR_(0,3)...
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
📊 Plotting CAR_(0,5)...
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
📊 Plotting CAR_(-1,+1)...
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
📊 Plotting CAR_(-1,+5)_ROBUST...
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
C:\Users\aledr\AppData\Local\Temp\ipykernel_17232\2340267492.py:83: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  plt.legend(title='Ticker')
No description has been provided for this image
C:\Users\aledr\AppData\Local\Temp\ipykernel_17232\2340267492.py:83: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  plt.legend(title='Ticker')
No description has been provided for this image
C:\Users\aledr\AppData\Local\Temp\ipykernel_17232\2340267492.py:83: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  plt.legend(title='Ticker')
No description has been provided for this image
C:\Users\aledr\AppData\Local\Temp\ipykernel_17232\2340267492.py:83: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  plt.legend(title='Ticker')
No description has been provided for this image
C:\Users\aledr\AppData\Local\Temp\ipykernel_17232\2340267492.py:83: UserWarning: No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
  plt.legend(title='Ticker')
No description has been provided for this image
✅ Analysis complete — histograms displayed and summary file saved as 'CAR_ticker_summaries.xlsx'
In [22]:
import matplotlib.pyplot as plt
import seaborn as sns

for sheet_name, df in car_sheets.items():
    if 'ticker' not in df.columns or 'CAR' not in df.columns:
        continue

    # Drop rows with missing ticker or CAR values
    df = df.dropna(subset=['ticker', 'CAR']).copy()

    # Ensure ticker is a proper string (so Seaborn recognizes it as categorical)
    df['ticker'] = df['ticker'].astype(str).str.strip()

    plt.figure(figsize=(8, 5))
    ax = sns.histplot(
        data=df,
        x='CAR',
        hue='ticker',
        bins=20,
        kde=True,
        element='step',
        alpha=0.5
    )

    # Force the legend to show actual labels
    handles, labels = ax.get_legend_handles_labels()

    # If seaborn doesn’t pick up the labels, rebuild them manually
    if not labels or labels == ['Ticker']:
        unique_tickers = sorted(df['ticker'].unique())
        handles = [plt.Line2D([0], [0], color=c, lw=4) for c in sns.color_palette(n_colors=len(unique_tickers))]
        labels = unique_tickers

    plt.legend(handles, labels, title='Ticker', title_fontsize=11, fontsize=10, loc='upper right')

    plt.title(f"Distribution of CAR by Ticker — {sheet_name}")
    plt.xlabel("CAR")
    plt.ylabel("Frequency")
    plt.tight_layout()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Share price changes Vs Markets

In [23]:
SandP_vs_share_prices = pd.read_csv('SandP - Stock Changes.csv')
In [24]:
import matplotlib.pyplot as plt

# Set up the figure and 3 subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharex=True, sharey=True)

# Scatter 1: NVDA vs S&P
axes[0].scatter(SandP_vs_share_prices['S&P_Change'], SandP_vs_share_prices['nvda_change'],
                alpha=0.6, color='green', edgecolor='black')
axes[0].set_title('S&P 500 vs NVIDIA')
axes[0].set_xlabel('S&P_Change')
axes[0].set_ylabel('nvda_change')
axes[0].grid(alpha=0.3)

# Scatter 2: AAPL vs S&P
axes[1].scatter(SandP_vs_share_prices['S&P_Change'], SandP_vs_share_prices['appl_change'],
                alpha=0.6, color='blue', edgecolor='black')
axes[1].set_title('S&P 500 vs Apple')
axes[1].set_xlabel('S&P_Change')
axes[1].set_ylabel('appl_change')
axes[1].grid(alpha=0.3)

# Scatter 3: GOOGL vs S&P
axes[2].scatter(SandP_vs_share_prices['S&P_Change'], SandP_vs_share_prices['goog_change'],
                alpha=0.6, color='orange', edgecolor='black')
axes[2].set_title('S&P 500 vs Google')
axes[2].set_xlabel('S&P_Change')
axes[2].set_ylabel('goog_change')
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [25]:
import pandas as pd
import statsmodels.api as sm

# Define your dataframe
df = SandP_vs_share_prices.copy()

# Define dependent variables (the three stocks)
stocks = ['nvda_change', 'appl_change', 'goog_change']

# Loop through each stock and run a regression vs. S&P_Change
for stock in stocks:
    print(f"\n=== Linear Regression: {stock} vs S&P_Change ===")

    # Drop missing values for the two relevant columns
    data = df[['S&P_Change', stock]].dropna()

    # Define X (independent variable) and y (dependent)
    X = sm.add_constant(data['S&P_Change'])  # adds intercept (alpha)
    y = data[stock]

    # Run Ordinary Least Squares regression
    model = sm.OLS(y, X).fit()

    # Print summary
    print(model.summary())

    # Extract key results
    alpha = model.params['const']
    beta = model.params['S&P_Change']
    r2 = model.rsquared
    print(f"Alpha: {alpha:.4f} | Beta: {beta:.4f} | R²: {r2:.4f}")
=== Linear Regression: nvda_change vs S&P_Change ===
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            nvda_change   R-squared:                       0.399
Model:                            OLS   Adj. R-squared:                  0.399
Method:                 Least Squares   F-statistic:                     2635.
Date:                Tue, 11 Nov 2025   Prob (F-statistic):               0.00
Time:                        12:28:00   Log-Likelihood:                 9460.4
No. Observations:                3973   AIC:                        -1.892e+04
Df Residuals:                    3971   BIC:                        -1.890e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0011      0.000      3.078      0.002       0.000       0.002
S&P_Change     1.6635      0.032     51.328      0.000       1.600       1.727
==============================================================================
Omnibus:                     1584.320   Durbin-Watson:                   2.039
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            50178.243
Skew:                           1.265   Prob(JB):                         0.00
Kurtosis:                      20.225   Cond. No.                         91.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Alpha: 0.0011 | Beta: 1.6635 | R²: 0.3988

=== Linear Regression: appl_change vs S&P_Change ===
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            appl_change   R-squared:                       0.477
Model:                            OLS   Adj. R-squared:                  0.477
Method:                 Least Squares   F-statistic:                     3621.
Date:                Tue, 11 Nov 2025   Prob (F-statistic):               0.00
Time:                        12:28:00   Log-Likelihood:                 11649.
No. Observations:                3973   AIC:                        -2.329e+04
Df Residuals:                    3971   BIC:                        -2.328e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0005      0.000      2.543      0.011       0.000       0.001
S&P_Change     1.1242      0.019     60.178      0.000       1.088       1.161
==============================================================================
Omnibus:                      549.580   Durbin-Watson:                   1.890
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             7460.253
Skew:                           0.047   Prob(JB):                         0.00
Kurtosis:                       9.712   Cond. No.                         91.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Alpha: 0.0005 | Beta: 1.1242 | R²: 0.4770

=== Linear Regression: goog_change vs S&P_Change ===
                            OLS Regression Results                            
==============================================================================
Dep. Variable:            goog_change   R-squared:                       0.469
Model:                            OLS   Adj. R-squared:                  0.469
Method:                 Least Squares   F-statistic:                     3504.
Date:                Tue, 11 Nov 2025   Prob (F-statistic):               0.00
Time:                        12:28:00   Log-Likelihood:                 11717.
No. Observations:                3973   AIC:                        -2.343e+04
Df Residuals:                    3971   BIC:                        -2.342e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0003      0.000      1.512      0.131   -9.04e-05       0.001
S&P_Change     1.0873      0.018     59.197      0.000       1.051       1.123
==============================================================================
Omnibus:                     1485.661   Durbin-Watson:                   1.937
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            64397.851
Skew:                           1.055   Prob(JB):                         0.00
Kurtosis:                      22.610   Cond. No.                         91.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Alpha: 0.0003 | Beta: 1.0873 | R²: 0.4688

Linear and mulilinear regressions using CAR values

In [4]:
import pandas as pd
import numpy as np
from pathlib import Path

# Use statsmodels for regression with standard errors and probability values
import statsmodels.api as statsmodels_api
In [16]:
# Step 1: list sheet names so you can target the right ones
event_book = pd.ExcelFile("event_study.xlsx")
feature_book = pd.ExcelFile("features_patched.xlsx")

print("Event sheets:")
print(event_book.sheet_names)
print("\nFeature sheets:")
print(feature_book.sheet_names)
Event sheets:
['README', 'CAR_(0,1)', 'CAR_(0,3)', 'CAR_(0,5)', 'CAR_(-1,+1)', 'CAR_(-1,+5)_ROBUST', 'CAAR_Summary', 'AlphaBeta_Params']

Feature sheets:
['Sheet1', 'features']
In [18]:
# Step 2: choose the sheets
# Event windows are every sheet that starts with "CAR"
event_window_sheets = [s for s in event_book.sheet_names if str(s).upper().startswith("CAR")]

# The features live in a sheet named "features"
features_sheet = "features"
In [20]:
# Step 3: read the features sheet once
features_table = pd.read_excel("features_patched.xlsx", sheet_name=features_sheet)
In [26]:
# Step 4: set join keys and predictor
join_keys = ["ticker", "announce_date", "timing", "day0"]
predictor_col = "gap_proxy_dm1_to_d0" 

# basic checks
missing_in_features = [c for c in join_keys if c not in features_table.columns]
if missing_in_features:
    raise ValueError(f"Join keys missing in features sheet: {missing_in_features}")
if predictor_col not in features_table.columns:
    raise ValueError(f"Missing predictor column: {predictor_col}")
In [28]:
# Step 5: helper to run one linear regression
import statsmodels.api as sm
def run_regression(y_series, x_series):
    clean = pd.DataFrame({"y": y_series, "x": x_series}).dropna()
    if len(clean) < 3:
        return None
    X = sm.add_constant(clean["x"])
    model = sm.OLS(clean["y"], X).fit()
    return {
        "intercept": float(model.params.get("const", np.nan)),
        "slope_on_gap_proxy_dm1_to_d0": float(model.params.get("x", np.nan)),
        "r_squared": float(model.rsquared),
        "p_value_for_slope": float(model.pvalues.get("x", np.nan)),
        "std_error_for_slope": float(model.bse.get("x", np.nan)),
        "rows_used": int(model.nobs),
    }
In [30]:
# Step 6: loop windows, merge, regress
results = []
skipped = []

for sheet in event_window_sheets:
    event_table = pd.read_excel("event_study.xlsx", sheet_name=sheet)

    # must have the join keys and a CAR column
    miss_event = [c for c in join_keys if c not in event_table.columns]
    if miss_event:
        skipped.append({"window_sheet": sheet, "reason": f"Missing join keys: {miss_event}"})
        continue
    if "CAR" not in event_table.columns:
        skipped.append({"window_sheet": sheet, "reason": "No CAR column"})
        continue

    # merge
    merged = pd.merge(
        event_table[join_keys + ["CAR"]],
        features_table[join_keys + [predictor_col]],
        on=join_keys,
        how="inner"
    )
    if merged.empty:
        skipped.append({"window_sheet": sheet, "reason": "Merge produced zero rows"})
        continue

    # regress CAR on gap
    out = run_regression(merged["CAR"], merged[predictor_col])
    if out is None:
        skipped.append({"window_sheet": sheet, "reason": "Too few rows after dropping missing values"})
        continue

    out["window_sheet"] = sheet
    out["car_column"] = "CAR"
    results.append(out)

results_table = pd.DataFrame(results).sort_values("window_sheet")
skipped_table = pd.DataFrame(skipped)

print("Results:")
display(results_table)
print("\nSkipped:")
display(skipped_table)
Results:
intercept slope_on_gap_proxy_dm1_to_d0 r_squared p_value_for_slope std_error_for_slope rows_used window_sheet car_column
3 -0.004615 0.946683 0.702821 2.890488e-35 0.054625 129 CAR_(-1,+1) CAR
4 -0.006355 0.978147 0.593682 1.327799e-26 0.071806 129 CAR_(-1,+5)_ROBUST CAR
0 -0.006765 1.001988 0.754168 1.640008e-40 0.050763 129 CAR_(0,1) CAR
1 -0.007769 1.062884 0.676748 6.140477e-33 0.065184 129 CAR_(0,3) CAR
2 -0.008506 1.033452 0.634210 1.627027e-29 0.069645 129 CAR_(0,5) CAR
Skipped:
In [35]:
# Linear regressions of CAR on gap_proxy_dm1_to_d0
# Run a separate regression for each ticker inside each CAR window.

import pandas as pd
import numpy as np
from pathlib import Path
import statsmodels.api as sm

# ----- Files -----
event_file = "event_study.xlsx"
features_file = "features_patched.xlsx"

# ----- Discover sheets -----
event_book = pd.ExcelFile(event_file)
feature_book = pd.ExcelFile(features_file)

# Pick every event window sheet that starts with "CAR"
event_window_sheets = [s for s in event_book.sheet_names if str(s).upper().startswith("CAR")]

# The features are in the "features" sheet
features_sheet = "features"

# ----- Join keys and predictor -----
join_keys = ["ticker", "announce_date", "timing", "day0"]
predictor_col = "gap_proxy_dm1_to_d0"

# ----- Load features once -----
features_table = pd.read_excel(features_file, sheet_name=features_sheet)

# Basic checks
missing_in_features = [c for c in join_keys if c not in features_table.columns]
if missing_in_features:
    raise ValueError(f"Join keys missing in features sheet: {missing_in_features}")
if predictor_col not in features_table.columns:
    raise ValueError(f"Missing predictor column in features sheet: {predictor_col}")

# ----- Helper: run one regression -----
def run_regression(y, x):
    frame = pd.DataFrame({"y": y, "x": x}).dropna()
    if len(frame) < 3:
        return None
    X = sm.add_constant(frame["x"])
    model = sm.OLS(frame["y"], X).fit()
    return {
        "intercept": float(model.params.get("const", np.nan)),
        "slope_on_gap_proxy_dm1_to_d0": float(model.params.get("x", np.nan)),
        "r_squared": float(model.rsquared),
        "p_value_for_slope": float(model.pvalues.get("x", np.nan)),
        "std_error_for_slope": float(model.bse.get("x", np.nan)),
        "rows_used": int(model.nobs),
    }

# ----- Loop windows, then tickers -----
all_rows = []
skipped = []

for window_sheet in event_window_sheets:
    event_table = pd.read_excel(event_file, sheet_name=window_sheet)

    # Must have the join keys and a CAR column
    missing_in_event = [c for c in join_keys if c not in event_table.columns]
    if missing_in_event:
        skipped.append({"window_sheet": window_sheet, "ticker": None,
                        "reason": f"Missing join keys in event sheet: {missing_in_event}"})
        continue
    if "CAR" not in event_table.columns:
        skipped.append({"window_sheet": window_sheet, "ticker": None,
                        "reason": "No CAR column in event sheet"})
        continue

    # Merge event rows with features on the keys
    merged = pd.merge(
        event_table[join_keys + ["CAR"]],
        features_table[join_keys + [predictor_col]],
        on=join_keys,
        how="inner"
    )

    if merged.empty:
        skipped.append({"window_sheet": window_sheet, "ticker": None,
                        "reason": "Merge produced zero rows"})
        continue

    # Group by ticker inside this window
    for ticker, grp in merged.groupby("ticker", dropna=False):
        out = run_regression(grp["CAR"], grp[predictor_col])
        if out is None:
            skipped.append({"window_sheet": window_sheet, "ticker": ticker,
                            "reason": "Too few rows after removing missing values"})
            continue
        out["window_sheet"] = window_sheet
        out["ticker"] = ticker
        out["car_column"] = "CAR"
        all_rows.append(out)

# ----- Build tables (robust to empty lists) -----
results_by_ticker = pd.DataFrame(all_rows)
if not results_by_ticker.empty:
    results_by_ticker = results_by_ticker.sort_values(
        ["window_sheet", "ticker"]
    ).reset_index(drop=True)
else:
    # create an empty frame with the expected columns
    results_by_ticker = pd.DataFrame(columns=[
        "window_sheet","ticker","car_column",
        "intercept","slope_on_gap_proxy_dm1_to_d0",
        "r_squared","p_value_for_slope","std_error_for_slope","rows_used"
    ])

skipped_table = pd.DataFrame(skipped)
if skipped_table.empty:
    # nothing was skipped
    skipped_table = pd.DataFrame(columns=["window_sheet","ticker","reason"])
else:
    # make sure the sort keys exist even if some dicts missed them
    for col in ["window_sheet", "ticker"]:
        if col not in skipped_table.columns:
            skipped_table[col] = pd.NA
    skipped_table = skipped_table.sort_values(
        ["window_sheet", "ticker"], na_position="last"
    ).reset_index(drop=True)

print("Results (first rows):")
display(results_by_ticker.head(20))

print("\nSkipped (first rows):")
display(skipped_table.head(20))
Results (first rows):
intercept slope_on_gap_proxy_dm1_to_d0 r_squared p_value_for_slope std_error_for_slope rows_used window_sheet ticker car_column
0 0.001128 0.850779 0.608825 6.886153e-10 0.106503 43 CAR_(-1,+1) AAPL CAR
1 -0.004444 0.780576 0.728623 3.517210e-13 0.074397 43 CAR_(-1,+1) GOOGL CAR
2 -0.011812 1.084953 0.737988 1.701914e-13 0.100961 43 CAR_(-1,+1) NVDA CAR
3 0.004423 0.969133 0.507463 8.430830e-08 0.149111 43 CAR_(-1,+5)_ROBUST AAPL CAR
4 -0.011178 0.824559 0.703350 2.219041e-12 0.083631 43 CAR_(-1,+5)_ROBUST GOOGL CAR
5 -0.013724 1.093544 0.609152 6.767499e-10 0.136800 43 CAR_(-1,+5)_ROBUST NVDA CAR
6 -0.002004 0.934096 0.685250 7.564049e-12 0.098869 43 CAR_(0,1) AAPL CAR
7 -0.008858 0.843446 0.838031 8.369310e-18 0.057910 43 CAR_(0,1) GOOGL CAR
8 -0.010297 1.117168 0.758700 3.105621e-14 0.098394 43 CAR_(0,1) NVDA CAR
9 -0.001606 1.019201 0.644760 9.300364e-11 0.118149 43 CAR_(0,3) AAPL CAR
10 -0.013247 0.913102 0.784202 3.096975e-15 0.074806 43 CAR_(0,3) GOOGL CAR
11 -0.009129 1.162476 0.659518 3.856087e-11 0.130444 43 CAR_(0,3) NVDA CAR
12 0.001291 1.052451 0.583153 2.585735e-09 0.138966 43 CAR_(0,5) AAPL CAR
13 -0.015592 0.887429 0.761765 2.384949e-14 0.077506 43 CAR_(0,5) GOOGL CAR
14 -0.012209 1.125759 0.626345 2.656516e-10 0.135795 43 CAR_(0,5) NVDA CAR
Skipped (first rows):
window_sheet ticker reason
In [37]:
# window_sheet, ticker, slope_on_gap_proxy_dm1_to_d0, std_error_for_slope, p_value_for_slope

import os
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

required_cols = {
    "window_sheet",
    "ticker",
    "slope_on_gap_proxy_dm1_to_d0",
    "std_error_for_slope",
    "p_value_for_slope",
}
missing = required_cols - set(results_by_ticker.columns)
if missing:
    raise ValueError(f"Missing columns in results_by_ticker: {missing}")

# Clean copy
res = results_by_ticker.copy()

# Make a folder for pictures
os.makedirs("figures", exist_ok=True)
In [39]:
import statsmodels.api as sm

def plot_scatter_for(window_name, ticker_name,
                     event_file="event_study.xlsx",
                     features_file="features_patched.xlsx",
                     features_sheet="features",
                     join_keys=("ticker","announce_date","timing","day0"),
                     predictor_col="gap_proxy_dm1_to_d0"):
    # Load the two tables
    event = pd.read_excel(event_file, sheet_name=window_name)
    feat = pd.read_excel(features_file, sheet_name=features_sheet)

    # Filter to the ticker
    event = event[event["ticker"] == ticker_name]
    feat = feat[feat["ticker"] == ticker_name]

    # Keep only needed columns
    event_use = event[list(join_keys) + ["CAR"]]
    feat_use = feat[list(join_keys) + [predictor_col]]

    merged = pd.merge(event_use, feat_use, on=list(join_keys), how="inner").dropna(subset=["CAR", predictor_col])
    if merged.empty:
        print("No merged rows for that pair.")
        return

    # Fit a line for the label
    X = sm.add_constant(merged[predictor_col])
    model = sm.OLS(merged["CAR"], X).fit()
    slope = model.params.get(predictor_col, np.nan)
    pval = model.pvalues.get(predictor_col, np.nan)

    # Build the plot
    plt.figure(figsize=(7, 5))
    plt.scatter(merged[predictor_col], merged["CAR"])
    # Draw the fitted line
    x_line = np.linspace(merged[predictor_col].min(), merged[predictor_col].max(), 100)
    y_line = model.params.get("const", 0.0) + slope * x_line
    plt.plot(x_line, y_line)

    plt.xlabel("gap_proxy_dm1_to_d0")
    plt.ylabel("CAR")
    plt.title(f"{ticker_name} — {window_name}\nSlope: {slope:.4g} | p value: {pval:.3g}")
    plt.tight_layout()
    out = f"figures/scatter_{window_name}_{ticker_name}.png".replace(" ", "_")
    plt.savefig(out, dpi=150)
    plt.show()
    print(f"Rows used: {int(model.nobs)}   Saved: {out}")
In [41]:
plot_scatter_for("CAR_(0,1)", "AAPL")  # change the ticker to one you hold
No description has been provided for this image
Rows used: 43   Saved: figures/scatter_CAR_(0,1)_AAPL.png
In [43]:
plot_scatter_for("CAR_(0,1)", "GOOGL")  # change the ticker to one you hold
No description has been provided for this image
Rows used: 43   Saved: figures/scatter_CAR_(0,1)_GOOGL.png
In [45]:
plot_scatter_for("CAR_(0,1)", "NVDA")  # change the ticker to one you hold
No description has been provided for this image
Rows used: 43   Saved: figures/scatter_CAR_(0,1)_NVDA.png
In [49]:
plot_scatter_for("CAR_(-1,+1)", "AAPL")  # change the ticker to one you hold
No description has been provided for this image
Rows used: 43   Saved: figures/scatter_CAR_(-1,+1)_AAPL.png
In [51]:
plot_scatter_for("CAR_(-1,+1)", "GOOGL")  # change the ticker to one you hold
No description has been provided for this image
Rows used: 43   Saved: figures/scatter_CAR_(-1,+1)_GOOGL.png
In [53]:
plot_scatter_for("CAR_(-1,+1)", "AAPL")  # change the ticker to one you hold
No description has been provided for this image
Rows used: 43   Saved: figures/scatter_CAR_(-1,+1)_NVDA.png
In [57]:
plot_scatter_for("CAR_(-1,+5)_ROBUST", "AAPL")  # change the ticker to one you hold
No description has been provided for this image
Rows used: 43   Saved: figures/scatter_CAR_(-1,+5)_ROBUST_AAPL.png
In [59]:
plot_scatter_for("CAR_(-1,+5)_ROBUST", "GOOGL")  # change the ticker to one you hold
No description has been provided for this image
Rows used: 43   Saved: figures/scatter_CAR_(-1,+5)_ROBUST_GOOGL.png
In [61]:
plot_scatter_for("CAR_(-1,+5)_ROBUST", "NVDA")  # change the ticker to one you hold
No description has been provided for this image
Rows used: 43   Saved: figures/scatter_CAR_(-1,+5)_ROBUST_NVDA.png
In [64]:
# Multiple linear regression for AAPL, window CAR_(0,1)
# Drivers: gap_proxy_dm1_to_d0, vix_chg_5d_lag1, pre_vol_5d

import pandas as pd
import numpy as np
from pathlib import Path
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# ----- Settings -----
event_file = "event_study.xlsx"
features_file = "features_patched.xlsx"
event_sheet = "CAR_(0,1)"
features_sheet = "features"
join_keys = ["ticker", "announce_date", "timing", "day0"]
target_col = "CAR"
ticker_filter = "AAPL"
predictors = ["gap_proxy_dm1_to_d0", "vix_chg_5d_lag1", "pre_vol_5d"]

# ----- Load -----
ev = pd.read_excel(event_file, sheet_name=event_sheet)
ft = pd.read_excel(features_file, sheet_name=features_sheet)

# Basic checks
need_ev = set(join_keys + [target_col])
need_ft = set(join_keys + predictors)
missing_ev = [c for c in need_ev if c not in ev.columns]
missing_ft = [c for c in need_ft if c not in ft.columns]
if missing_ev:
    raise ValueError(f"Event sheet is missing: {missing_ev}")
if missing_ft:
    raise ValueError(f"Features sheet is missing: {missing_ft}")

# ----- Merge and filter to ticker -----
merged = (
    pd.merge(
        ev[join_keys + [target_col]],
        ft[join_keys + predictors],
        on=join_keys,
        how="inner"
    )
    .query("ticker == @ticker_filter")
    .dropna(subset=[target_col] + predictors)
    .copy()
)

n_rows = len(merged)
print(f"Rows for {ticker_filter} in {event_sheet}: {n_rows}")
if n_rows < 10:
    print("Warning: very few rows. Treat the result as weak.")

# ----- Build design matrices -----
X = merged[predictors]
X = sm.add_constant(X)  # add intercept
y = merged[target_col]

# ----- Fit model (ordinary) -----
ols = sm.OLS(y, X).fit()

# ----- Fit model (heteroskedasticity-robust, HC3) -----
ols_hc3 = sm.OLS(y, X).fit(cov_type="HC3")

# ----- Tidy tables -----
def tidy_result(res):
    coefs = res.params
    ses = res.bse
    tvals = res.tvalues
    pvals = res.pvalues
    out = pd.DataFrame({
        "term": coefs.index,
        "coefficient": coefs.values,
        "standard_error": ses.values,
        "t_value": tvals.values,
        "p_value": pvals.values
    })
    return out

tidy_ordinary = tidy_result(ols)
tidy_robust = tidy_result(ols_hc3)

# ----- R squared and sample info -----
metrics = pd.DataFrame([{
    "r_squared": float(ols.rsquared),
    "r_squared_adjusted": float(ols.rsquared_adj),
    "r_squared_robust_same_fit": float(ols_hc3.rsquared),  # same fit, different errors
    "observations": int(ols.nobs)
}])

# ----- Multicollinearity check (VIF) -----
# Drop the constant for VIF calculation and rebuild array with constant first column
X_no_const = merged[predictors]
X_vif = np.column_stack([np.ones(len(X_no_const))] + [X_no_const[c].values for c in predictors])
vif_rows = []
for i, name in enumerate(["const"] + predictors):
    try:
        vif_val = variance_inflation_factor(X_vif, i)
    except Exception:
        vif_val = np.nan
    vif_rows.append({"term": name, "vif": float(vif_val)})
vif_table = pd.DataFrame(vif_rows)

# ----- Show outputs -----
print("\nCoefficients (ordinary errors):")
display(tidy_ordinary)

print("\nCoefficients (robust errors, HC3):")
display(tidy_robust)

print("\nModel metrics:")
display(metrics)

print("\nVariance inflation factors:")
display(vif_table)

# ----- Save to a file -----
out_path = Path(f"AAPL_CAR_0_1_MLR.xlsx")
with pd.ExcelWriter(out_path, engine="xlsxwriter") as writer:
    pd.DataFrame([{
        "event_sheet": event_sheet,
        "ticker": ticker_filter,
        "predictors": ", ".join(predictors)
    }]).to_excel(writer, sheet_name="meta", index=False)
    tidy_ordinary.to_excel(writer, sheet_name="coefficients_ordinary", index=False)
    tidy_robust.to_excel(writer, sheet_name="coefficients_robust", index=False)
    metrics.to_excel(writer, sheet_name="metrics", index=False)
    vif_table.to_excel(writer, sheet_name="vif", index=False)

print(f"\nSaved: {out_path.resolve()}")
Rows for AAPL in CAR_(0,1): 43

Coefficients (ordinary errors):
term coefficient standard_error t_value p_value
0 const -0.004600 0.008614 -0.534067 5.963268e-01
1 gap_proxy_dm1_to_d0 0.928265 0.096965 9.573204 8.671299e-12
2 vix_chg_5d_lag1 -0.043587 0.023180 -1.880392 6.753806e-02
3 pre_vol_5d 0.285117 0.540734 0.527278 6.009872e-01
Coefficients (robust errors, HC3):
term coefficient standard_error t_value p_value
0 const -0.004600 0.007579 -0.607009 5.438447e-01
1 gap_proxy_dm1_to_d0 0.928265 0.109294 8.493307 2.008396e-17
2 vix_chg_5d_lag1 -0.043587 0.024225 -1.799252 7.197884e-02
3 pre_vol_5d 0.285117 0.515844 0.552720 5.804554e-01
Model metrics:
r_squared r_squared_adjusted r_squared_robust_same_fit observations
0 0.712389 0.690265 0.712389 43
Variance inflation factors:
term vif
0 const 4.797689
1 gap_proxy_dm1_to_d0 1.001275
2 vix_chg_5d_lag1 1.007522
3 pre_vol_5d 1.006412
Saved: C:\Users\dcazo\Documents\AAPL_CAR_0_1_MLR.xlsx
In [66]:
# Visualise AAPL in window CAR_(0,1)
# Three scatter plots with best-fit lines (one per predictor)
# One observed vs predicted plot from the multiple linear regression

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as statsmodels_api

# ---- Settings you can change ----
event_file = "event_study.xlsx"
features_file = "features_patched.xlsx"
event_sheet = "CAR_(0,1)"
features_sheet = "features"
ticker_to_show = "AAPL"
join_keys = ["ticker", "announce_date", "timing", "day0"]
target_column = "CAR"
predictor_columns = ["gap_proxy_dm1_to_d0", "vix_chg_5d_lag1", "pre_vol_5d"]

# ---- Load and join ----
event_table = pd.read_excel(event_file, sheet_name=event_sheet)
features_table = pd.read_excel(features_file, sheet_name=features_sheet)

merged_data = pd.merge(
    event_table[join_keys + [target_column]],
    features_table[join_keys + predictor_columns],
    on=join_keys,
    how="inner",
)

merged_data = merged_data.loc[merged_data["ticker"] == ticker_to_show]
merged_data = merged_data.dropna(subset=[target_column] + predictor_columns).copy()

if merged_data.empty:
    raise ValueError("No rows found after merge and filter. Check the ticker, window, or column names.")

# ---- Helper: simple scatter with best-fit line y = a + b x ----
def scatter_with_fit(data_frame, x_name, y_name, title_text):
    x_values = data_frame[x_name].to_numpy()
    y_values = data_frame[y_name].to_numpy()

    X_design = statsmodels_api.add_constant(x_values)
    model = statsmodels_api.OLS(y_values, X_design).fit()
    intercept = float(model.params[0])
    slope = float(model.params[1])
    r_squared = float(model.rsquared)

    x_line = np.linspace(x_values.min(), x_values.max(), 100)
    y_line = intercept + slope * x_line

    plt.figure(figsize=(7, 5))
    plt.scatter(x_values, y_values)
    plt.plot(x_line, y_line)
    plt.xlabel(x_name)
    plt.ylabel(y_name)
    plt.title(f"{title_text}\nSlope: {slope:.4g}   Intercept: {intercept:.4g}   R squared: {r_squared:.3f}")
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.tight_layout()
    plt.show()

# ---- Draw one plot per predictor ----
for predictor in predictor_columns:
    scatter_with_fit(
        merged_data,
        predictor,
        target_column,
        title_text=f"{ticker_to_show} — {event_sheet}",
    )

# ---- Observed vs predicted from the full multiple regression ----
X_full = statsmodels_api.add_constant(merged_data[predictor_columns])
y_full = merged_data[target_column].to_numpy()
mlr_model = statsmodels_api.OLS(y_full, X_full).fit()
y_pred = mlr_model.fittedvalues.to_numpy()
r2_full = float(mlr_model.rsquared)

# Best-fit line between predicted and observed (not forced to 45 degrees)
X_line = statsmodels_api.add_constant(y_pred)
line_model = statsmodels_api.OLS(y_full, X_line).fit()
line_intercept, line_slope = line_model.params
x_line = np.linspace(y_pred.min(), y_pred.max(), 100)
y_line = line_intercept + line_slope * x_line

plt.figure(figsize=(7, 5))
plt.scatter(y_pred, y_full)
plt.plot(x_line, y_line)
plt.plot(x_line, x_line, linestyle="--")  # 45 degree reference
plt.xlabel("Predicted CAR")
plt.ylabel("Observed CAR")
plt.title(f"{ticker_to_show} — {event_sheet}\nObserved vs Predicted   R squared: {r2_full:.3f}")
plt.grid(True, linestyle="--", alpha=0.4)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [68]:
# Multiple linear regression for GOOGL, window CAR_(0,1)
# Drivers: gap_proxy_dm1_to_d0, pre_vol_5d, eps_surprise_pct
# This prints clean tables and draws plots inside the notebook. No files are written.

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt

# ----- Settings -----
event_file = "event_study.xlsx"
features_file = "features_patched.xlsx"
event_sheet = "CAR_(0,1)"
features_sheet = "features"

join_keys = ["ticker", "announce_date", "timing", "day0"]
target_col = "CAR"
ticker_filter = "GOOGL"
predictors = ["gap_proxy_dm1_to_d0", "pre_vol_5d", "eps_surprise_pct"]

# ----- Load -----
events = pd.read_excel(event_file, sheet_name=event_sheet)
features = pd.read_excel(features_file, sheet_name=features_sheet)

# sanity checks
need_events = set(join_keys + [target_col])
need_features = set(join_keys + predictors)
miss_events = [c for c in need_events if c not in events.columns]
miss_features = [c for c in need_features if c not in features.columns]
if miss_events:
    raise ValueError(f"Event sheet is missing: {miss_events}")
if miss_features:
    raise ValueError(f"Features sheet is missing: {miss_features}")

# ----- Merge and filter -----
merged = (
    pd.merge(events[join_keys + [target_col]],
             features[join_keys + predictors],
             on=join_keys, how="inner")
      .query("ticker == @ticker_filter")
      .dropna(subset=[target_col] + predictors)
      .copy()
)

print(f"Rows for {ticker_filter} in {event_sheet}: {len(merged)}")
if len(merged) < 10:
    print("Warning: very few rows. Treat the result as weak.")

# ----- Design matrices -----
X = sm.add_constant(merged[predictors])  # adds the intercept
y = merged[target_col]

# ----- Fit models -----
ols = sm.OLS(y, X).fit()
ols_hc3 = sm.OLS(y, X).fit(cov_type="HC3")

# ----- Tidy coefficient tables -----
def tidy(res):
    return pd.DataFrame({
        "term": res.params.index,
        "coefficient": res.params.values,
        "standard_error": res.bse.values,
        "t_value": res.tvalues.values,
        "p_value": res.pvalues.values
    })

coefs_ordinary = tidy(ols)
coefs_robust = tidy(ols_hc3)

# ----- Model metrics -----
metrics = pd.DataFrame([{
    "r_squared": float(ols.rsquared),
    "r_squared_adjusted": float(ols.rsquared_adj),
    "observations": int(ols.nobs)
}])

# ----- Variance inflation factors -----
X_for_vif = np.column_stack([np.ones(len(merged))] + [merged[c].to_numpy() for c in predictors])
vif_rows = []
for i, name in enumerate(["const"] + predictors):
    try:
        vif_val = variance_inflation_factor(X_for_vif, i)
    except Exception:
        vif_val = np.nan
    vif_rows.append({"term": name, "variance_inflation_factor": float(vif_val)})
vif_table = pd.DataFrame(vif_rows)

# ----- Show tables -----
print("\nCoefficients (ordinary errors):")
display(coefs_ordinary)

print("\nCoefficients (robust errors, HC3):")
display(coefs_robust)

print("\nModel metrics:")
display(metrics)

print("\nVariance inflation factors:")
display(vif_table)

# ----- Plots inside the notebook -----
# 1) Observed versus predicted
y_hat = ols.fittedvalues.to_numpy()
plt.figure(figsize=(7,5))
plt.scatter(y_hat, y)
# best fit line between predicted and observed
X_line = sm.add_constant(y_hat)
line_model = sm.OLS(y, X_line).fit()
a2, b2 = line_model.params
xx = np.linspace(y_hat.min(), y_hat.max(), 100)
yy = a2 + b2 * xx
plt.plot(xx, yy)
# 45-degree reference
plt.plot(xx, xx, linestyle="--")
plt.xlabel("Predicted CAR")
plt.ylabel("Observed CAR")
plt.title(f"{ticker_filter} — {event_sheet}\nObserved versus Predicted   R squared: {ols.rsquared:.3f}")
plt.tight_layout()
plt.show()

# 2) Residuals versus fitted
resid = ols.resid.to_numpy()
plt.figure(figsize=(7,5))
plt.scatter(y_hat, resid)
plt.axhline(0.0, linestyle="--")
plt.xlabel("Predicted CAR")
plt.ylabel("Residual")
plt.title(f"{ticker_filter} — {event_sheet}\nResiduals versus Predicted")
plt.tight_layout()
plt.show()

# 3) Quantile–quantile plot of residuals
sm.qqplot(resid, line="45")
plt.title(f"{ticker_filter} — {event_sheet}\nResiduals quantile–quantile")
plt.tight_layout()
plt.show()
Rows for GOOGL in CAR_(0,1): 43

Coefficients (ordinary errors):
term coefficient standard_error t_value p_value
0 const -0.013904 0.007469 -1.861613 7.020632e-02
1 gap_proxy_dm1_to_d0 0.839998 0.060735 13.830488 1.268989e-16
2 pre_vol_5d 0.315221 0.434983 0.724674 4.729774e-01
3 eps_surprise_pct 0.000550 0.017200 0.031958 9.746682e-01
Coefficients (robust errors, HC3):
term coefficient standard_error t_value p_value
0 const -0.013904 0.007046 -1.973195 4.847331e-02
1 gap_proxy_dm1_to_d0 0.839998 0.075371 11.144780 7.593171e-29
2 pre_vol_5d 0.315221 0.460504 0.684513 4.936511e-01
3 eps_surprise_pct 0.000550 0.013306 0.041310 9.670484e-01
Model metrics:
r_squared r_squared_adjusted observations
0 0.840422 0.828146 43
Variance inflation factors:
term variance_inflation_factor
0 const 5.130352
1 gap_proxy_dm1_to_d0 1.061984
2 pre_vol_5d 1.088619
3 eps_surprise_pct 1.150988
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [70]:
# Scatter plots with best fit lines for GOOGL in window CAR_(0,1)
# One plot per driver and one full model plot

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# ---- Settings you can change ----
event_file = "event_study.xlsx"
features_file = "features_patched.xlsx"
event_sheet = "CAR_(0,1)"
features_sheet = "features"
ticker = "GOOGL"
join_keys = ["ticker", "announce_date", "timing", "day0"]
target_col = "CAR"
predictors = ["gap_proxy_dm1_to_d0", "pre_vol_5d", "eps_surprise_pct"]

# ---- Load and merge ----
events = pd.read_excel(event_file, sheet_name=event_sheet)
features = pd.read_excel(features_file, sheet_name=features_sheet)

data = (
    pd.merge(
        events[join_keys + [target_col]],
        features[join_keys + predictors],
        on=join_keys,
        how="inner",
    )
    .query("ticker == @ticker")
    .dropna(subset=[target_col] + predictors)
    .copy()
)

if data.empty:
    raise ValueError("No rows after merge and filter. Check the ticker, window, or column names.")

# ---- Helper: scatter with best fit line y = a + b x ----
def scatter_with_fit(df, x_name, y_name, title_text):
    x = df[x_name].to_numpy()
    y = df[y_name].to_numpy()

    X = sm.add_constant(x)
    model = sm.OLS(y, X).fit()
    a, b = model.params  # intercept, slope

    x_line = np.linspace(x.min(), x.max(), 100)
    y_line = a + b * x_line

    plt.figure(figsize=(7, 5))
    plt.scatter(x, y)
    plt.plot(x_line, y_line)
    plt.xlabel(x_name)
    plt.ylabel(y_name)
    plt.title(f"{title_text}\nSlope: {b:.4g}   Intercept: {a:.4g}   R squared: {model.rsquared:.3f}")
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.tight_layout()
    plt.show()

# ---- One plot per driver ----
for xcol in predictors:
    scatter_with_fit(data, xcol, target_col, f"{ticker} — {event_sheet}")

# ---- Full model: observed versus predicted ----
X_full = sm.add_constant(data[predictors])
y_full = data[target_col].to_numpy()
mlr = sm.OLS(y_full, X_full).fit()
y_hat = mlr.fittedvalues.to_numpy()

# Best fit line between predicted and observed (not forced to forty five degrees)
X_line = sm.add_constant(y_hat)
line_model = sm.OLS(y_full, X_line).fit()
a2, b2 = line_model.params
xx = np.linspace(y_hat.min(), y_hat.max(), 100)
yy = a2 + b2 * xx

plt.figure(figsize=(7, 5))
plt.scatter(y_hat, y_full)
plt.plot(xx, yy)
plt.plot(xx, xx, linestyle="--")  # forty five degree reference
plt.xlabel("Predicted CAR")
plt.ylabel("Observed CAR")
plt.title(f"{ticker} — {event_sheet}\nObserved versus Predicted   R squared: {mlr.rsquared:.3f}")
plt.grid(True, linestyle="--", alpha=0.4)
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [ ]:
 
In [1]:
# === Setup ===
from pathlib import Path
import re
import numpy as np
import pandas as pd

from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression

# === Paths ===
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."),          # notebook folder
    Path("/mnt/data"),  # uploaded files fallback
]
FEATURE_FILES = ["features v1.xlsx", "features v2.xlsx", "features v3.xlsx"]
EVENT_FILE = "event_study.xlsx"

# Optional manual overrides if names are odd
MANUAL_COLNAMES = {
    # "features v1.xlsx": {"day0": "day0", "ticker": "ticker"},
    # "features v2.xlsx": {"day0": "day0", "ticker": "ticker"},
    # "features v3.xlsx": {"day0": "day0", "ticker": "ticker"},
    # "event_study.xlsx": {"day0": "day0", "ticker": "ticker"},
}

# === Helpers ===
def find_file(filename):
    for base in BASE_DIRS:
        p = base / filename
        if p.exists():
            return p
    return None

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    """Skip readme sheets. Pick the sheet with the most numeric columns, then most rows."""
    candidates = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not candidates:
        return max(book, key=lambda n: len(book[n]))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    name, _ = max(candidates, key=score)
    return name

def find_event_window_sheets(book: dict):
    """Map each window to its sheet by name pattern."""
    sheet_map = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for name in book.keys():
        if is_readme_sheet(name):
            continue
        for w, pat in pats.items():
            if sheet_map[w] is None and pat.search(str(name)):
                sheet_map[w] = name
    return sheet_map

def find_day0_column(df: pd.DataFrame) -> str | None:
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
    if strict:
        return strict[0]
    fallbacks = [
        "event_date","EventDate","EVENT_DATE","eventDate",
        "announcement_date","AnnouncementDate","ANNOUNCEMENT_DATE","ann_date","AnnDate",
        "date","Date","DATE","trading_date","TradingDate",
        "day0date","date0","Date0","DATE0"
    ]
    for name in fallbacks:
        if name in df.columns:
            return name
    best, best_nonnull = None, -1
    for c in df.columns:
        s = pd.to_datetime(df[c], errors="coerce")
        nonnull = int(s.notna().sum())
        if nonnull > best_nonnull:
            best, best_nonnull = c, nonnull
    return best if best_nonnull > 0 else None

def find_ticker_column(df: pd.DataFrame) -> str | None:
    tickers = [
        "ticker","Ticker","symbol","Symbol","ric","RIC","permno","PERMNO",
        "isin","ISIN","cusip","CUSIP","sedol","SEDOL"
    ]
    for name in tickers:
        if name in df.columns:
            return name
    # last resort: pick a non-numeric column with many unique short codes
    obj_cols = df.select_dtypes(include=["object"]).columns
    best, best_score = None, -1
    for c in obj_cols:
        s = df[c].astype(str).str.strip()
        uniq = s.nunique()
        avg_len = s.str.len().mean()
        score = uniq - 0.1*avg_len
        if uniq > 50 and score > best_score:
            best, best_score = c, score
    return best

def normalize_day0(s: pd.Series) -> pd.Series:
    # pick the parse that yields more valid dates
    d1 = pd.to_datetime(s, errors="coerce").dt.normalize()
    d2 = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    use = d2.where(d2.notna(), d1)
    return use

def normalize_ticker(s: pd.Series) -> pd.Series:
    return s.astype(str).str.strip().str.upper()

def coerce_numeric(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    for c in out.columns:
        if out[c].dtype == "object":
            try:
                out[c] = pd.to_numeric(out[c], errors="raise")
            except Exception:
                pass
    return out

def find_target_column_event(df: pd.DataFrame) -> str | None:
    cols = list(df.columns)
    pri = [c for c in cols if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if pri:
        return pri[0]
    sec = [c for c in cols if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    if sec:
        return sec[0]
    # last resort: the only numeric column left besides keys
    return None

def aggregate_features_by_keys(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str) -> pd.DataFrame:
    """
    One row per [day0, ticker].
    Aggregate numeric predictors by mean.
    Keep the key columns.
    """
    df = df_feat_raw.copy()
    df["__day0__"] = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()

    grouped = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    return grouped, num_cols  # return numeric predictors list for later filtering

def build_X_from_features_only(merged: pd.DataFrame, numeric_feature_cols: list, target_col: str) -> pd.DataFrame:
    """
    Use numeric predictors that came from the features sheet.
    Drop the target if it shares a name.
    Drop zero variance columns.
    """
    keep_cols = [c for c in numeric_feature_cols if c in merged.columns]
    X = merged.loc[:, keep_cols].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunique = X.nunique(dropna=False)
    X = X.loc[:, nunique > 1]
    return X

def fit_and_score(X: pd.DataFrame, y: pd.Series, k_max=5, random_state=42):
    data = pd.concat([y, X], axis=1).dropna()
    y_clean = data.iloc[:, 0]
    X_clean = data.iloc[:, 1:]

    n_rows = len(y_clean)
    n_feat = X_clean.shape[1]

    if n_feat == 0 or n_rows < max(10, n_feat + 2):
        return {"rows_used": int(n_rows), "features_used": int(n_feat),
                "r_squared": np.nan, "adjusted_r_squared": np.nan, "cross_validated_r_squared": np.nan}

    model = LinearRegression()
    model.fit(X_clean.values, y_clean.values)

    r2 = float(model.score(X_clean.values, y_clean.values))
    n = float(n_rows)
    p = float(n_feat)
    adj = 1.0 - (1.0 - r2) * (n - 1.0) / (n - p - 1.0) if (n - p - 1.0) > 0 else np.nan

    splits = min(k_max, n_rows)
    if splits < 3:
        cv_r2 = np.nan
    else:
        kf = KFold(n_splits=splits if splits <= 5 else 5, shuffle=True, random_state=random_state)
        cv_scores = cross_val_score(LinearRegression(), X_clean.values, y_clean.values, cv=kf, scoring="r2")
        cv_r2 = float(np.nanmean(cv_scores))

    return {"rows_used": int(n_rows), "features_used": int(n_feat),
            "r_squared": r2, "adjusted_r_squared": adj, "cross_validated_r_squared": cv_r2}

# === Load event study and map windows ===
evt_path = find_file(EVENT_FILE)
if evt_path is None:
    raise FileNotFoundError(f"Could not find {EVENT_FILE}")

evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
window_to_sheet = find_event_window_sheets(evt_book)
if not any(window_to_sheet.values()):
    raise ValueError("Could not detect event study sheets for 0,1 0,3 0,5.")

# === Main loop with join on [day0, ticker] ===
merge_log = []
rows = []

for feat_name in FEATURE_FILES:
    fpath = find_file(feat_name)
    if fpath is None:
        print(f"Warning: {feat_name} not found. Skipping.")
        continue

    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    feat_sheet = choose_features_sheet(feat_book)
    df_feat_raw = feat_book[feat_sheet].copy()

    # resolve column names
    day0_feat = MANUAL_COLNAMES.get(feat_name, {}).get("day0") or find_day0_column(df_feat_raw)
    ticker_feat = MANUAL_COLNAMES.get(feat_name, {}).get("ticker") or find_ticker_column(df_feat_raw)
    if day0_feat is None or ticker_feat is None:
        print(f"\n{feat_name}: could not find day0 or ticker. Found day0={day0_feat}, ticker={ticker_feat}.")
        continue

    # aggregate to one row per key
    feat_agg, numeric_cols = aggregate_features_by_keys(df_feat_raw, day0_feat, ticker_feat)

    for window_key in ["0,1", "0,3", "0,5"]:
        evt_sheet = window_to_sheet.get(window_key)
        if evt_sheet is None:
            print(f"{feat_name} | window {window_key}: no event sheet.")
            continue
        df_evt = evt_book[evt_sheet].copy()

        # resolve event columns
        day0_evt = MANUAL_COLNAMES.get(EVENT_FILE, {}).get("day0") or find_day0_column(df_evt)
        ticker_evt = MANUAL_COLNAMES.get(EVENT_FILE, {}).get("ticker") or find_ticker_column(df_evt)
        y_col = find_target_column_event(df_evt)

        if day0_evt is None or ticker_evt is None or y_col is None:
            print(f"{feat_name} | window {window_key}: cannot resolve columns. day0={day0_evt}, ticker={ticker_evt}, target={y_col}")
            continue

        # normalise and dedupe event by keys
        evt_clean = df_evt.copy()
        evt_clean["__day0__"] = normalize_day0(evt_clean[day0_evt])
        evt_clean["__ticker__"] = normalize_ticker(evt_clean[ticker_evt])
        evt_targets = evt_clean[["__day0__","__ticker__", y_col]].dropna(subset=["__day0__","__ticker__", y_col])
        # drop duplicates on keys, keep first target for that key
        dup_evt = int(evt_targets.duplicated(subset=["__day0__","__ticker__"]).sum())
        evt_targets = evt_targets.drop_duplicates(subset=["__day0__","__ticker__"], keep="first")

        # join on both keys
        merged = feat_agg.merge(evt_targets, on=["__day0__","__ticker__"], how="inner")
        merged_rows = len(merged)
        missing_predictors_rows = int(pd.concat([merged[[y_col]], merged[numeric_cols]], axis=1).isna().any(axis=1).sum())

        # build predictors
        X = build_X_from_features_only(merged, numeric_cols, target_col=y_col)
        if X.shape[1] == 0 or merged_rows == 0:
            print(f"{feat_name} | window {window_key}: zero predictors or zero rows after merge.")
            continue

        y = merged[y_col]
        metrics = fit_and_score(X, y)

        merge_log.append({
            "features_file": feat_name,
            "features_sheet": feat_sheet,
            "event_sheet": evt_sheet,
            "window": window_key,
            "day0_features_col": day0_feat,
            "ticker_features_col": ticker_feat,
            "day0_event_col": day0_evt,
            "ticker_event_col": ticker_evt,
            "duplicates_in_event_for_keys": dup_evt,
            "rows_in_features_after_groupby": len(feat_agg),
            "rows_after_merge": merged_rows,
            "rows_dropped_due_to_missing_predictors_or_target": missing_predictors_rows,
            "predictors_used": metrics["features_used"],
            "target_col": y_col,
        })

        rows.append({
            "features_file": feat_name,
            "features_sheet": feat_sheet,
            "event_sheet": evt_sheet,
            "window": window_key,
            "rows_used": metrics["rows_used"],
            "features_used": metrics["features_used"],
            "r_squared": metrics["r_squared"],
            "adjusted_r_squared": metrics["adjusted_r_squared"],
            "cross_validated_r_squared": metrics["cross_validated_r_squared"],
        })

# === Show results ===
from IPython.display import display
pd.set_option("display.max_columns", None)

log_df = pd.DataFrame(merge_log)
res_df = pd.DataFrame(rows)

if not log_df.empty:
    print("\nMerge audit (joined on day0 + ticker, features deduplicated by mean within keys):")
    display(log_df)

if res_df.empty:
    print("\nNo models were fit. Set MANUAL_COLNAMES at the top if the column names are unusual.")
else:
    order = {"0,1": 0, "0,3": 1, "0,5": 2}
    res_df["window_order"] = res_df["window"].map(order).fillna(99)
    res_df = res_df.sort_values(["window_order", "features_file"]).drop(columns=["window_order"])

    print("\nDetailed results (one row per features set and window):")
    display(res_df.reset_index(drop=True))

    print("\nComparison table (rows are windows, columns are metrics per features set):")
    wide = res_df.pivot_table(index=["window"],
                              columns="features_file",
                              values=["r_squared", "adjusted_r_squared", "cross_validated_r_squared"],
                              aggfunc="first")
    display(wide)

    print("\nBest by adjusted coefficient of determination within each window:")
    for w in ["0,1", "0,3", "0,5"]:
        block = res_df[res_df["window"] == w]
        if not block.empty:
            top = block.sort_values("adjusted_r_squared", ascending=False).iloc[0]
            print(f"  Window {w}: {top['features_file']}  adjusted={top['adjusted_r_squared']:.4f}  cross_validated={top['cross_validated_r_squared']:.4f}")
Merge audit (joined on day0 + ticker, features deduplicated by mean within keys):
features_file features_sheet event_sheet window day0_features_col ticker_features_col day0_event_col ticker_event_col duplicates_in_event_for_keys rows_in_features_after_groupby rows_after_merge rows_dropped_due_to_missing_predictors_or_target predictors_used target_col
0 features v1.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 0 129 129 0 16 CAR
1 features v1.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 0 129 129 0 16 CAR
2 features v1.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 0 129 129 0 16 CAR
3 features v2.xlsx data CAR_(0,1) 0,1 day0 ticker day0 ticker 0 129 129 0 24 CAR
4 features v2.xlsx data CAR_(0,3) 0,3 day0 ticker day0 ticker 0 129 129 0 24 CAR
5 features v2.xlsx data CAR_(0,5) 0,5 day0 ticker day0 ticker 0 129 129 0 24 CAR
6 features v3.xlsx data CAR_(0,1) 0,1 day0 ticker day0 ticker 0 129 129 0 41 CAR
7 features v3.xlsx data CAR_(0,3) 0,3 day0 ticker day0 ticker 0 129 129 0 41 CAR
8 features v3.xlsx data CAR_(0,5) 0,5 day0 ticker day0 ticker 0 129 129 0 41 CAR
Detailed results (one row per features set and window):
features_file features_sheet event_sheet window rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared
0 features v1.xlsx features CAR_(0,1) 0,1 129 16 0.303485 0.203983 -0.053040
1 features v2.xlsx data CAR_(0,1) 0,1 129 24 0.335126 0.181693 -0.135035
2 features v3.xlsx data CAR_(0,1) 0,1 129 41 0.452112 0.193912 -0.347043
3 features v1.xlsx features CAR_(0,3) 0,3 129 16 0.250824 0.143799 -0.147430
4 features v2.xlsx data CAR_(0,3) 0,3 129 24 0.272953 0.105174 -0.263206
5 features v3.xlsx data CAR_(0,3) 0,3 129 41 0.403288 0.122078 -0.516824
6 features v1.xlsx features CAR_(0,5) 0,5 129 16 0.257400 0.151314 -0.108714
7 features v2.xlsx data CAR_(0,5) 0,5 129 24 0.273454 0.105789 -0.202303
8 features v3.xlsx data CAR_(0,5) 0,5 129 41 0.414750 0.138942 -0.429681
Comparison table (rows are windows, columns are metrics per features set):
adjusted_r_squared cross_validated_r_squared r_squared
features_file features v1.xlsx features v2.xlsx features v3.xlsx features v1.xlsx features v2.xlsx features v3.xlsx features v1.xlsx features v2.xlsx features v3.xlsx
window
0,1 0.203983 0.181693 0.193912 -0.053040 -0.135035 -0.347043 0.303485 0.335126 0.452112
0,3 0.143799 0.105174 0.122078 -0.147430 -0.263206 -0.516824 0.250824 0.272953 0.403288
0,5 0.151314 0.105789 0.138942 -0.108714 -0.202303 -0.429681 0.257400 0.273454 0.414750
Best by adjusted coefficient of determination within each window:
  Window 0,1: features v1.xlsx  adjusted=0.2040  cross_validated=-0.0530
  Window 0,3: features v1.xlsx  adjusted=0.1438  cross_validated=-0.1474
  Window 0,5: features v1.xlsx  adjusted=0.1513  cross_validated=-0.1087
In [2]:
# --- MLR visualisations: join on [day0 + ticker], features-only predictors ---

# If needed first run:
# !pip install pandas numpy scikit-learn matplotlib openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score, cross_val_predict

# ====== CONFIG ======
DATA_DIR = Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data")
EVENT_FILE = DATA_DIR / "event_study.xlsx"
FEATURE_FILES = [
    DATA_DIR / "features v1.xlsx",
    DATA_DIR / "features v2.xlsx",
    DATA_DIR / "features v3.xlsx",
]
WINDOWS = ["0,1","0,3","0,5"]  # CAR windows to use
BEST_SET_FOR_SCATTERS = "features v1.xlsx"  # pick which features set to show in the scatter plots
SAVE_FIGS = False  # set True if you want PNGs saved next to this notebook

# ====== HELPERS ======
def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))

def choose_features_sheet(book: dict) -> str:
    # pick the non-readme sheet with most numeric cols, then most rows
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands: 
        return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_day0_column(df: pd.DataFrame) -> str | None:
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, re.I)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest: best, kbest = c, k
    return best

def find_ticker_column(df: pd.DataFrame) -> str | None:
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    # fallback: likely code column
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score: best, score = c, sc
    return best

def normalize_day0(s: pd.Series) -> pd.Series:
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series) -> pd.Series:
    return s.astype(str).str.strip().str.upper()

def find_event_window_sheets(book: dict):
    out = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.I),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.I),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.I),
    }
    for name in book:
        if is_readme_sheet(name): 
            continue
        for w, pat in pats.items():
            if out[w] is None and pat.search(str(name)): 
                out[w] = name
    return out

def find_target_column_event(df: pd.DataFrame) -> str | None:
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    # drop zero-variance
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def metrics_and_predictions(X: pd.DataFrame, y: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return {"rows": int(n), "p": int(p), "r2": np.nan, "adj": np.nan, "cv": np.nan,
                "y": pd.Series(dtype=float), "yhat_cv": pd.Series(dtype=float)}
    model = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(model.score(X_c.values, y_c.values))
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    kf = KFold(n_splits=min(5, n), shuffle=True, random_state=42)
    cv = float(np.mean(cross_val_score(LinearRegression(), X_c.values, y_c.values, cv=kf, scoring="r2")))
    yhat_cv = pd.Series(cross_val_predict(LinearRegression(), X_c.values, y_c.values, cv=kf), index=y_c.index)
    return {"rows": int(n), "p": int(p), "r2": r2, "adj": adj, "cv": cv, "y": y_c, "yhat_cv": yhat_cv}

# ====== LOAD ======
evt_book = pd.read_excel(EVENT_FILE, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)

features_data = {}
for fpath in FEATURE_FILES:
    book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(book)
    raw = book[fsheet].copy()
    dcol = find_day0_column(raw)
    tcol = find_ticker_column(raw)
    grouped, num_cols = aggregate_features(raw, dcol, tcol)
    features_data[fpath.name] = {"grouped": grouped, "num_cols": num_cols, "sheet": fsheet, "day0": dcol, "ticker": tcol}

# ====== METRICS ======
rows = []
preds = {}  # (features, window) -> (y, yhat_cv)

for w in WINDOWS:
    esheet = win_map.get(w)
    if esheet is None:
        raise ValueError(f"Could not find event sheet for window {w}.")
    df_evt = evt_book[esheet].copy()
    d0_evt = find_day0_column(df_evt)
    tk_evt = find_ticker_column(df_evt)
    ycol = find_target_column_event(df_evt)
    evt = df_evt.copy()
    evt["__day0__"]   = normalize_day0(evt[d0_evt])
    evt["__ticker__"] = normalize_ticker(evt[tk_evt])
    evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

    for fname, pack in features_data.items():
        merged = pack["grouped"].merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        X = build_X(merged, pack["num_cols"], ycol)
        y = merged[ycol]
        m = metrics_and_predictions(X, y)
        rows.append({
            "features_file": fname, "window": w,
            "rows_used": m["rows"], "features_used": m["p"],
            "r_squared": m["r2"], "adjusted_r_squared": m["adj"], "cross_validated_r_squared": m["cv"]
        })
        preds[(fname, w)] = (m["y"], m["yhat_cv"])

metrics = pd.DataFrame(rows).sort_values(["window","features_file"]).reset_index(drop=True)
display(metrics)

# Save metrics (optional)
out_csv = DATA_DIR / "mlr_metrics_by_features_and_window.csv"
metrics.to_csv(out_csv, index=False)
print(f"Saved metrics to: {out_csv}")

# ====== PLOTS ======
# 1) Bars: cross-validated R^2 by features set, for each window
for w in WINDOWS:
    sub = metrics[metrics["window"] == w]
    if sub.empty: 
        continue
    plt.figure()
    plt.bar(sub["features_file"], sub["cross_validated_r_squared"])
    plt.title(f"Cross Validated R Squared by Features Set (Window {w})")
    plt.xlabel("Features Set")
    plt.ylabel("Cross Validated R Squared")
    plt.xticks(rotation=30, ha="right")
    plt.tight_layout()
    if SAVE_FIGS:
        plt.savefig(DATA_DIR / f"cv_r2_bar_window_{w.replace(',','_')}.png", dpi=160)
    plt.show()

# 2) Lines: adjusted R^2, cross-validated R^2, and R^2 across windows
for metric in ["adjusted_r_squared", "cross_validated_r_squared", "r_squared"]:
    plt.figure()
    for fname in features_data.keys():
        xs, ys = [], []
        for w in WINDOWS:
            row = metrics[(metrics["features_file"] == fname) & (metrics["window"] == w)]
            if not row.empty:
                xs.append(w)
                ys.append(float(row.iloc[0][metric]))
        if ys:
            plt.plot(xs, ys, marker="o", label=fname)
    plt.title(metric.replace("_"," ").title() + " Across Windows")
    plt.xlabel("Window")
    plt.ylabel(metric.replace("_"," ").title())
    plt.legend()
    plt.tight_layout()
    if SAVE_FIGS:
        plt.savefig(DATA_DIR / f"{metric}_across_windows.png", dpi=160)
    plt.show()

# 3) Scatters: out-of-sample predictions vs actual, for the chosen features set, per window
for w in WINDOWS:
    y, yhat = preds.get((BEST_SET_FOR_SCATTERS, w), (pd.Series(dtype=float), pd.Series(dtype=float)))
    if y.empty: 
        continue
    plt.figure()
    plt.scatter(y, yhat, alpha=0.7)
    # 45-degree line
    mn = float(min(y.min(), yhat.min()))
    mx = float(max(y.max(), yhat.max()))
    plt.plot([mn, mx], [mn, mx])
    plt.title(f"Out-of-sample Predictions vs Actual (Window {w}) — {BEST_SET_FOR_SCATTERS}")
    plt.xlabel("Actual CAR")
    plt.ylabel("Predicted CAR (Cross Validated)")
    plt.tight_layout()
    if SAVE_FIGS:
        plt.savefig(DATA_DIR / f"scatter_cv_{BEST_SET_FOR_SCATTERS.replace(' ','_')}_{w.replace(',','_')}.png", dpi=160)
    plt.show()

# 4) Bars: adjusted R^2 vs number of predictors (per window)
for w in WINDOWS:
    sub = metrics[metrics["window"] == w].copy()
    if sub.empty: 
        continue
    plt.figure()
    plt.bar(sub["features_used"].astype(int).astype(str), sub["adjusted_r_squared"])
    plt.title(f"Adjusted R Squared by Number of Predictors (Window {w})")
    plt.xlabel("Number of Predictors")
    plt.ylabel("Adjusted R Squared")
    plt.tight_layout()
    if SAVE_FIGS:
        plt.savefig(DATA_DIR / f"adj_r2_vs_nfeatures_{w.replace(',','_')}.png", dpi=160)
    plt.show()
features_file window rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared
0 features v1.xlsx 0,1 129 16 0.303485 0.203983 -0.053040
1 features v2.xlsx 0,1 129 24 0.335126 0.181693 -0.135035
2 features v3.xlsx 0,1 129 41 0.452112 0.193912 -0.347043
3 features v1.xlsx 0,3 129 16 0.250824 0.143799 -0.147430
4 features v2.xlsx 0,3 129 24 0.272953 0.105174 -0.263206
5 features v3.xlsx 0,3 129 41 0.403288 0.122078 -0.516824
6 features v1.xlsx 0,5 129 16 0.257400 0.151314 -0.108714
7 features v2.xlsx 0,5 129 24 0.273454 0.105789 -0.202303
8 features v3.xlsx 0,5 129 41 0.414750 0.138942 -0.429681
Saved metrics to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data\mlr_metrics_by_features_and_window.csv
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [3]:
# === Feature pruning with grouped cross validation (ticker-aware) ===
# Join on [day0 + ticker]. Use features-only predictors. No columns from event study besides target.
# Outputs:
#   - Baseline cross validated coefficient of determination for each features set and window
#   - Leave-one-feature-out deltas (how much each feature helps or hurts)
#   - Suggested drop list (features that hurt)
#   - New score after dropping suggested features
#   - Lasso stability selection frequency (how often a feature survives lasso across folds)
#
# If needed first run:
# !pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd

from sklearn.model_selection import GroupKFold, KFold
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# ====== CONFIG ======
DATA_DIR = Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data")
EVENT_FILE = DATA_DIR / "event_study.xlsx"
FEATURE_FILES = [
    DATA_DIR / "features v1.xlsx",
    DATA_DIR / "features v2.xlsx",
    DATA_DIR / "features v3.xlsx",
]
WINDOWS = ["0,1","0,3","0,5"]
NEGATIVE_DELTA_THRESHOLD = 0.005   # drop a feature if removing it improves cross validated coefficient of determination by at least this much
MAX_FOLDS = 5                      # up to five folds for grouped cross validation
RANDOM_STATE = 42

# ====== HELPERS ======
def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands:
        return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_day0_column(df: pd.DataFrame) -> str | None:
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","EVENT_DATE","eventDate",
              "announcement_date","AnnouncementDate","ANNOUNCEMENT_DATE","ann_date","AnnDate",
              "date","Date","DATE","trading_date","TradingDate",
              "day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    # fallback: most date-like
    best, best_nonnull = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > best_nonnull:
            best, best_nonnull = c, k
    return best

def find_ticker_column(df: pd.DataFrame) -> str | None:
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns:
            return c
    # fallback: likely code column
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score:
            best, score = c, sc
    return best

def normalize_day0(s: pd.Series) -> pd.Series:
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series) -> pd.Series:
    return s.astype(str).str.strip().str.upper()

def find_event_window_sheets(book: dict):
    out = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for name in book:
        if is_readme_sheet(name): 
            continue
        for w, pat in pats.items():
            if out[w] is None and pat.search(str(name)):
                out[w] = name
    return out

def find_target_column_event(df: pd.DataFrame) -> str | None:
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    # drop zero-variance
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def grouped_cv_r2(model, X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    n_groups = int(groups.nunique())
    n_splits = max(3, min(max_folds, n_groups))  # at least three splits
    gkf = GroupKFold(n_splits=n_splits)
    scores = []
    for tr, te in gkf.split(X, y, groups=groups):
        model.fit(X.iloc[tr].values, y.iloc[tr].values)
        r2 = model.score(X.iloc[tr].values, y.iloc[tr].values)  # in-sample on train
        # we want out-of-sample on test:
        y_pred = model.predict(X.iloc[te].values)
        # coefficient of determination on test fold
        y_true = y.iloc[te].values
        ss_res = np.sum((y_true - y_pred)**2)
        ss_tot = np.sum((y_true - np.mean(y_true))**2)
        r2_test = 1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan
        scores.append(r2_test)
    return float(np.nanmean(scores))

def leave_one_feature_out_deltas(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    base = grouped_cv_r2(LinearRegression(), X, y, groups, max_folds=max_folds)
    rows = []
    for col in X.columns:
        X_drop = X.drop(columns=[col])
        r2_drop = grouped_cv_r2(LinearRegression(), X_drop, y, groups, max_folds=max_folds)
        delta = base - r2_drop  # positive = feature helps; negative = feature hurts
        rows.append({"feature": col, "base_cross_validated_r_squared": base,
                     "cross_validated_r_squared_without_feature": r2_drop,
                     "delta": delta})
    out = pd.DataFrame(rows).sort_values("delta", ascending=True).reset_index(drop=True)
    return base, out

def lasso_stability_selection(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5, alphas=None):
    if alphas is None:
        alphas = np.logspace(-4, 1, 12)
    n_groups = int(groups.nunique())
    n_splits = max(3, min(max_folds, n_groups))
    gkf = GroupKFold(n_splits=n_splits)
    counts = pd.Series(0, index=X.columns, dtype=int)
    for tr, te in gkf.split(X, y, groups=groups):
        Xtr, ytr = X.iloc[tr], y.iloc[tr]
        gtr = groups.iloc[tr]
        # inner split on training only (not group-aware inside to keep it light)
        best_score, best_alpha = -1e9, None
        for a in alphas:
            pipe = Pipeline([("scaler", StandardScaler(with_mean=True, with_std=True)),
                             ("lasso", Lasso(alpha=a, max_iter=10000, random_state=RANDOM_STATE))])
            # simple inner score with ordinary k-fold on training only
            kf = KFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)
            vals = []
            for tr2, te2 in kf.split(Xtr, ytr):
                pipe.fit(Xtr.iloc[tr2].values, ytr.iloc[tr2].values)
                ypred = pipe.predict(Xtr.iloc[te2].values)
                ytrue = ytr.iloc[te2].values
                ss_res = np.sum((ytrue - ypred)**2); ss_tot = np.sum((ytrue - np.mean(ytrue))**2)
                r2 = 1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan
                vals.append(r2)
            mean_score = float(np.nanmean(vals))
            if mean_score > best_score:
                best_score, best_alpha = mean_score, a
        # fit with best alpha on full training fold and count non-zero features
        pipe = Pipeline([("scaler", StandardScaler(with_mean=True, with_std=True)),
                         ("lasso", Lasso(alpha=best_alpha, max_iter=10000, random_state=RANDOM_STATE))])
        pipe.fit(Xtr.values, ytr.values)
        coefs = pipe.named_steps["lasso"].coef_
        support = (np.abs(coefs) > 1e-12)
        counts.loc[X.columns[support]] += 1
    freq = (counts / n_splits).rename("lasso_selection_frequency").to_frame()
    return freq.sort_values("lasso_selection_frequency", ascending=False)

# ====== LOAD DATASETS ======
evt_book = pd.read_excel(EVENT_FILE, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)

def build_dataset(features_path: Path, window_key: str):
    # features
    feat_book = pd.read_excel(features_path, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    df_feat_raw = feat_book[fsheet].copy()
    dfeat = find_day0_column(df_feat_raw)
    tfeat = find_ticker_column(df_feat_raw)
    feat_grouped, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)

    # event
    esheet = win_map.get(window_key)
    df_evt = evt_book[esheet].copy()
    devt = find_day0_column(df_evt)
    tevt = find_ticker_column(df_evt)
    ycol = find_target_column_event(df_evt)

    evt_targets = df_evt.copy()
    evt_targets["__day0__"]   = normalize_day0(evt_targets[devt])
    evt_targets["__ticker__"] = normalize_ticker(evt_targets[tevt])
    evt_targets = evt_targets.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

    merged = feat_grouped.merge(evt_targets[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
    groups = merged["__ticker__"]  # group by ticker for cross validation
    X = build_X(merged, num_cols, ycol)
    y = merged[ycol].astype(float)
    return X, y, groups, fsheet, ycol, dfeat, tfeat, devt, tevt, len(merged)

# ====== MAIN LOOP ======
all_summaries = []
all_lofo = []
all_stability = []
all_drop_runs = []

for features_path in FEATURE_FILES:
    for w in WINDOWS:
        X, y, groups, fsheet, ycol, dfeat, tfeat, devt, tevt, n_merged = build_dataset(features_path, w)

        if X.empty or len(y) < 10:
            print(f"Skip {features_path.name} | window {w}: not enough data.")
            continue

        # Baseline grouped cross validated coefficient of determination
        base_cv = grouped_cv_r2(LinearRegression(), X, y, groups, max_folds=MAX_FOLDS)

        # Leave-one-feature-out
        base_check, lofo = leave_one_feature_out_deltas(X, y, groups, max_folds=MAX_FOLDS)
        # The two base numbers should match
        base_cv = float(base_cv)
        lofo["features_file"] = features_path.name
        lofo["window"] = w
        all_lofo.append(lofo)

        # Suggested drops: features with negative delta below threshold (removing improves score)
        drop_list = lofo[lofo["delta"] <= -NEGATIVE_DELTA_THRESHOLD]["feature"].tolist()

        # New score after dropping suggested features
        if drop_list:
            X_pruned = X.drop(columns=drop_list)
            new_cv = grouped_cv_r2(LinearRegression(), X_pruned, y, groups, max_folds=MAX_FOLDS)
        else:
            new_cv = base_cv

        all_summaries.append({
            "features_file": features_path.name,
            "window": w,
            "rows_used": len(y),
            "features_used": X.shape[1],
            "baseline_cross_validated_r_squared": base_cv,
            "n_features_flagged_to_drop": len(drop_list),
            "new_cross_validated_r_squared_after_drop": new_cv
        })

        # Lasso stability selection (light)
        stability = lasso_stability_selection(X, y, groups, max_folds=MAX_FOLDS)
        stability["features_file"] = features_path.name
        stability["window"] = w
        all_stability.append(stability.reset_index().rename(columns={"index":"feature"}))

        # Store the actual drop list for reporting
        if drop_list:
            all_drop_runs.append(pd.DataFrame({
                "features_file": [features_path.name]*len(drop_list),
                "window": [w]*len(drop_list),
                "feature_dropped": drop_list
            }))

# ====== OUTPUT TABLES ======
summary_df = pd.DataFrame(all_summaries).sort_values(["window","features_file"]).reset_index(drop=True)
print("\n=== Summary per features set and window (grouped by ticker) ===")
display(summary_df)

# Save
summary_df.to_csv(DATA_DIR / "feature_pruning_summary.csv", index=False)

lofo_df = pd.concat(all_lofo, ignore_index=True) if all_lofo else pd.DataFrame()
if not lofo_df.empty:
    # Order with most harmful first (most negative delta)
    lofo_df = lofo_df.sort_values(["window","features_file","delta"])
    print("\n=== Leave-one-feature-out deltas (negative = harmful) ===")
    display(lofo_df)
    lofo_df.to_csv(DATA_DIR / "leave_one_feature_out_deltas.csv", index=False)

stab_df = pd.concat(all_stability, ignore_index=True) if all_stability else pd.DataFrame()
if not stab_df.empty:
    print("\n=== Lasso stability selection frequency (0 to 1) ===")
    display(stab_df.sort_values(["window","features_file","lasso_selection_frequency"], ascending=[True, True, False]).reset_index(drop=True))
    stab_df.to_csv(DATA_DIR / "lasso_stability_selection.csv", index=False)

if all_drop_runs:
    drops_df = pd.concat(all_drop_runs, ignore_index=True)
    print("\n=== Features flagged for drop by window and set ===")
    display(drops_df)
    drops_df.to_csv(DATA_DIR / "features_flagged_for_drop.csv", index=False)

print("\nFiles saved to:", DATA_DIR)
print(" - feature_pruning_summary.csv")
print(" - leave_one_feature_out_deltas.csv")
print(" - lasso_stability_selection.csv")
print(" - features_flagged_for_drop.csv")
=== Summary per features set and window (grouped by ticker) ===
features_file window rows_used features_used baseline_cross_validated_r_squared n_features_flagged_to_drop new_cross_validated_r_squared_after_drop
0 features v1.xlsx 0,1 129 16 -0.115372 10 0.118568
1 features v2.xlsx 0,1 129 24 -0.178825 11 0.032140
2 features v3.xlsx 0,1 129 41 -1.321893 25 -0.047858
3 features v1.xlsx 0,3 129 16 -0.155072 10 0.095185
4 features v2.xlsx 0,3 129 24 -0.241672 11 -0.036041
5 features v3.xlsx 0,3 129 41 -1.708581 26 -0.059018
6 features v1.xlsx 0,5 129 16 -0.089552 10 0.125117
7 features v2.xlsx 0,5 129 24 -0.193283 10 0.015720
8 features v3.xlsx 0,5 129 41 -1.425380 27 -0.010384
=== Leave-one-feature-out deltas (negative = harmful) ===
feature base_cross_validated_r_squared cross_validated_r_squared_without_feature delta features_file window
0 pre_ret_10d -0.115372 -0.040835 -7.453721e-02 features v1.xlsx 0,1
1 pre_vol_3d -0.115372 -0.054075 -6.129709e-02 features v1.xlsx 0,1
2 mkt_ret_1d_lag1 -0.115372 -0.072298 -4.307393e-02 features v1.xlsx 0,1
3 pre_vol_5d -0.115372 -0.075562 -3.980961e-02 features v1.xlsx 0,1
4 pre_ret_5d -0.115372 -0.079015 -3.635720e-02 features v1.xlsx 0,1
... ... ... ... ... ... ...
238 high_yield_option_adjusted_spread_pct -1.425380 -1.425380 4.072298e-13 features v3.xlsx 0,5
239 macro_cpi_yoy -1.425380 -1.440643 1.526227e-02 features v3.xlsx 0,5
240 vix_level_lag1 -1.425380 -1.459459 3.407871e-02 features v3.xlsx 0,5
241 pre_ret_3d -1.425380 -1.515569 9.018888e-02 features v3.xlsx 0,5
242 eps_surprise_pct -1.425380 -1.851641 4.262606e-01 features v3.xlsx 0,5

243 rows × 6 columns

=== Lasso stability selection frequency (0 to 1) ===
feature lasso_selection_frequency features_file window
0 eps_surprise_pct 1.000000 features v1.xlsx 0,1
1 pre_ret_3d 1.000000 features v1.xlsx 0,1
2 vix_chg_5d_lag1 1.000000 features v1.xlsx 0,1
3 macro_us10y 1.000000 features v1.xlsx 0,1
4 pre_vol_3d 0.666667 features v1.xlsx 0,1
... ... ... ... ...
238 quarter 0.000000 features v3.xlsx 0,5
239 vix_x_surprise 0.000000 features v3.xlsx 0,5
240 rates_x_surprise 0.000000 features v3.xlsx 0,5
241 high_rates_regime 0.000000 features v3.xlsx 0,5
242 high_vix_regime 0.000000 features v3.xlsx 0,5

243 rows × 4 columns

=== Features flagged for drop by window and set ===
features_file window feature_dropped
0 features v1.xlsx 0,1 pre_ret_10d
1 features v1.xlsx 0,1 pre_vol_3d
2 features v1.xlsx 0,1 mkt_ret_1d_lag1
3 features v1.xlsx 0,1 pre_vol_5d
4 features v1.xlsx 0,1 pre_ret_5d
... ... ... ...
135 features v3.xlsx 0,5 macro_us10y
136 features v3.xlsx 0,5 cpi_x_surprise
137 features v3.xlsx 0,5 high_vix_regime
138 features v3.xlsx 0,5 vix_chg_5d_lag1
139 features v3.xlsx 0,5 is_january

140 rows × 3 columns

Files saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - feature_pruning_summary.csv
 - leave_one_feature_out_deltas.csv
 - lasso_stability_selection.csv
 - features_flagged_for_drop.csv
In [1]:
# === Test pruned features v1.1 / v2.1 / v3.1 vs originals (join on day0 + ticker) ===
# If needed first:  pip install pandas numpy scikit-learn openpyxl matplotlib

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold

# -------- config --------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data"),
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES_CANDIDATES = [
    "features v1.xlsx","features v2.xlsx","features v3.xlsx",
    "features v1.1.xlsx","features v2.1.xlsx","features v3.1.xlsx",
]
WINDOWS = ["0,1","0,3","0,5"]   # CAR windows to test
MAX_GROUP_FOLDS = 5
RANDOM_STATE = 42

# -------- helpers --------
def find_file(name):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    return None

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands: return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    m = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for name in book.keys():
        if is_readme_sheet(name): continue
        for w, pat in pats.items():
            if m[w] is None and pat.search(str(name)): m[w] = name
    return m

def find_day0_column(df: pd.DataFrame):
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    # most date-like
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest: best, kbest = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    # fallback
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score:
            best, score = c, sc
    return best

def normalize_day0(s: pd.Series) -> pd.Series:
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series) -> pd.Series:
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    # average test coefficient of determination across group folds
    n_groups = int(groups.nunique())
    n_splits = max(3, min(max_folds, n_groups))
    gkf = GroupKFold(n_splits=n_splits)
    model = LinearRegression()
    scores = []
    for tr, te in gkf.split(X, y, groups=groups):
        model.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_pred = model.predict(X.iloc[te].values)
        y_true = y.iloc[te].values
        ss_res = np.sum((y_true - y_pred)**2)
        ss_tot = np.sum((y_true - np.mean(y_true))**2)
        r2_test = 1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan
        scores.append(r2_test)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    # in-sample coefficient of determination and adjusted coefficient of determination
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    cv = grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# -------- load event workbook --------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
    raise FileNotFoundError("event_study.xlsx not found in any base directory")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)

# -------- run all available features files --------
present = [f for f in FEATURE_FILES_CANDIDATES if find_file(f) is not None]
if not present:
    raise FileNotFoundError("No features files found. Check paths.")

print("Testing these files:", present)

all_rows = []
merge_audit = []

for fname in present:
    fpath = find_file(fname)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    df_feat_raw = feat_book[fsheet].copy()

    dfeat = find_day0_column(df_feat_raw)
    tfeat = find_ticker_column(df_feat_raw)
    feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)

    for w in WINDOWS:
        esheet = win_map.get(w)
        if esheet is None:
            print(f"Missing event sheet for window {w}. Skipping.")
            continue

        df_evt = evt_book[esheet].copy()
        devt = find_day0_column(df_evt)
        tevt = find_ticker_column(df_evt)
        ycol = find_target_col(df_evt)

        evt = df_evt.copy()
        evt["__day0__"]   = normalize_day0(evt[devt])
        evt["__ticker__"] = normalize_ticker(evt[tevt])
        evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

        merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        groups = merged["__ticker__"]
        X = build_X(merged, num_cols, ycol)
        y = merged[ycol].astype(float)

        # audit
        merge_audit.append({
            "features_file": fname, "features_sheet": fsheet, "window": w,
            "day0_features_col": dfeat, "ticker_features_col": tfeat,
            "day0_event_col": devt, "ticker_event_col": tevt,
            "merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
        })

        # metrics
        m = fit_and_score(X, y, groups)
        m.update(dict(features_file=fname, features_sheet=fsheet, event_sheet=esheet, window=w))
        all_rows.append(m)

# -------- results --------
audit_df = pd.DataFrame(merge_audit)
res_df = pd.DataFrame(all_rows)

pd.set_option("display.max_columns", None)
print("\nMerge audit (check keys and row counts):")
display(audit_df)

print("\nResults per features set and window (grouped by ticker):")
display(res_df.sort_values(["window","features_file"]).reset_index(drop=True))

# -------- optional: compare pruned vs original if both exist --------
def base_tag(name: str) -> str:
    # "features v1.xlsx" -> "v1", "features v1.1.xlsx" -> "v1"
    m = re.search(r"features\s+v(\d+)", name, flags=re.IGNORECASE)
    return f"v{m.group(1)}" if m else name

res_df["tag"] = res_df["features_file"].apply(base_tag)
res_df["is_pruned"] = res_df["features_file"].str.contains(r"\.1\.xlsx$", flags=re.IGNORECASE)

pairs = []
for w in WINDOWS:
    for tag in sorted(res_df["tag"].unique()):
        block = res_df[(res_df["window"] == w) & (res_df["tag"] == tag)]
        if block["is_pruned"].nunique() < 2:
            continue  # need both original and pruned
        base = block.loc[block["is_pruned"] == False].iloc[0]
        prun = block.loc[block["is_pruned"] == True].iloc[0]
        pairs.append({
            "window": w, "set": tag,
            "baseline_cross_validated_r_squared": base["cross_validated_r_squared"],
            "pruned_cross_validated_r_squared": prun["cross_validated_r_squared"],
            "delta_cross_validated_r_squared": prun["cross_validated_r_squared"] - base["cross_validated_r_squared"],
            "baseline_adjusted_r_squared": base["adjusted_r_squared"],
            "pruned_adjusted_r_squared": prun["adjusted_r_squared"],
            "delta_adjusted_r_squared": prun["adjusted_r_squared"] - base["adjusted_r_squared"],
            "baseline_r_squared": base["r_squared"],
            "pruned_r_squared": prun["r_squared"],
            "delta_r_squared": prun["r_squared"] - base["r_squared"],
            "rows_used_baseline": base["rows_used"], "rows_used_pruned": prun["rows_used"],
            "features_used_baseline": base["features_used"], "features_used_pruned": prun["features_used"],
        })

if pairs:
    comp = pd.DataFrame(pairs).sort_values(["window","set"]).reset_index(drop=True)
    print("\nBefore vs after (original vs pruned) — deltas > 0 are good:")
    display(comp)

# Save results
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "test_results_all_sets.csv", index=False)
if pairs:
    comp.to_csv(out_dir / "test_results_pruned_vs_original.csv", index=False)
print(f"\nSaved CSVs to: {out_dir}")
print(" - test_results_all_sets.csv")
print(" - test_results_pruned_vs_original.csv")
Testing these files: ['features v1.xlsx', 'features v2.xlsx', 'features v3.xlsx', 'features v1.1.xlsx', 'features v2.1.xlsx', 'features v3.1.xlsx']

Merge audit (check keys and row counts):
features_file features_sheet window day0_features_col ticker_features_col day0_event_col ticker_event_col merged_rows predictor_cols target_col
0 features v1.xlsx features 0,1 day0 ticker day0 ticker 129 16 CAR
1 features v1.xlsx features 0,3 day0 ticker day0 ticker 129 16 CAR
2 features v1.xlsx features 0,5 day0 ticker day0 ticker 129 16 CAR
3 features v2.xlsx data 0,1 day0 ticker day0 ticker 129 24 CAR
4 features v2.xlsx data 0,3 day0 ticker day0 ticker 129 24 CAR
5 features v2.xlsx data 0,5 day0 ticker day0 ticker 129 24 CAR
6 features v3.xlsx data 0,1 day0 ticker day0 ticker 129 41 CAR
7 features v3.xlsx data 0,3 day0 ticker day0 ticker 129 41 CAR
8 features v3.xlsx data 0,5 day0 ticker day0 ticker 129 41 CAR
9 features v1.1.xlsx features 0,1 day0 ticker day0 ticker 129 14 CAR
10 features v1.1.xlsx features 0,3 day0 ticker day0 ticker 129 14 CAR
11 features v1.1.xlsx features 0,5 day0 ticker day0 ticker 129 14 CAR
12 features v2.1.xlsx data 0,1 day0 ticker day0 ticker 129 20 CAR
13 features v2.1.xlsx data 0,3 day0 ticker day0 ticker 129 20 CAR
14 features v2.1.xlsx data 0,5 day0 ticker day0 ticker 129 20 CAR
15 features v3.1.xlsx data 0,1 day0 ticker day0 ticker 129 37 CAR
16 features v3.1.xlsx data 0,3 day0 ticker day0 ticker 129 37 CAR
17 features v3.1.xlsx data 0,5 day0 ticker day0 ticker 129 37 CAR
Results per features set and window (grouped by ticker):
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared features_file features_sheet event_sheet window
0 129 14 0.289191 0.201899 -0.047634 features v1.1.xlsx features CAR_(0,1) 0,1
1 129 16 0.303485 0.203983 -0.115372 features v1.xlsx features CAR_(0,1) 0,1
2 129 20 0.318352 0.192121 -0.106293 features v2.1.xlsx data CAR_(0,1) 0,1
3 129 24 0.335126 0.181693 -0.178825 features v2.xlsx data CAR_(0,1) 0,1
4 129 37 0.424671 0.190745 -0.824364 features v3.1.xlsx data CAR_(0,1) 0,1
5 129 41 0.452112 0.193912 -1.321893 features v3.xlsx data CAR_(0,1) 0,1
6 129 14 0.238836 0.145359 -0.062822 features v1.1.xlsx features CAR_(0,3) 0,3
7 129 16 0.250824 0.143799 -0.155072 features v1.xlsx features CAR_(0,3) 0,3
8 129 20 0.258931 0.121696 -0.157287 features v2.1.xlsx data CAR_(0,3) 0,3
9 129 24 0.272953 0.105174 -0.241672 features v2.xlsx data CAR_(0,3) 0,3
10 129 37 0.379130 0.126689 -1.047901 features v3.1.xlsx data CAR_(0,3) 0,3
11 129 41 0.403288 0.122078 -1.708581 features v3.xlsx data CAR_(0,3) 0,3
12 129 14 0.248291 0.155976 -0.008339 features v1.1.xlsx features CAR_(0,5) 0,5
13 129 16 0.257400 0.151314 -0.089552 features v1.xlsx features CAR_(0,5) 0,5
14 129 20 0.265767 0.129798 -0.117638 features v2.1.xlsx data CAR_(0,5) 0,5
15 129 24 0.273454 0.105789 -0.193283 features v2.xlsx data CAR_(0,5) 0,5
16 129 37 0.393061 0.146284 -0.894411 features v3.1.xlsx data CAR_(0,5) 0,5
17 129 41 0.414750 0.138942 -1.425380 features v3.xlsx data CAR_(0,5) 0,5
Before vs after (original vs pruned) — deltas > 0 are good:
window set baseline_cross_validated_r_squared pruned_cross_validated_r_squared delta_cross_validated_r_squared baseline_adjusted_r_squared pruned_adjusted_r_squared delta_adjusted_r_squared baseline_r_squared pruned_r_squared delta_r_squared rows_used_baseline rows_used_pruned features_used_baseline features_used_pruned
0 0,1 v1 -0.115372 -0.047634 0.067738 0.203983 0.201899 -0.002085 0.303485 0.289191 -0.014294 129 129 16 14
1 0,1 v2 -0.178825 -0.106293 0.072531 0.181693 0.192121 0.010427 0.335126 0.318352 -0.016774 129 129 24 20
2 0,1 v3 -1.321893 -0.824364 0.497529 0.193912 0.190745 -0.003166 0.452112 0.424671 -0.027441 129 129 41 37
3 0,3 v1 -0.155072 -0.062822 0.092250 0.143799 0.145359 0.001561 0.250824 0.238836 -0.011988 129 129 16 14
4 0,3 v2 -0.241672 -0.157287 0.084385 0.105174 0.121696 0.016522 0.272953 0.258931 -0.014023 129 129 24 20
5 0,3 v3 -1.708581 -1.047901 0.660679 0.122078 0.126689 0.004610 0.403288 0.379130 -0.024157 129 129 41 37
6 0,5 v1 -0.089552 -0.008339 0.081212 0.151314 0.155976 0.004662 0.257400 0.248291 -0.009109 129 129 16 14
7 0,5 v2 -0.193283 -0.117638 0.075645 0.105789 0.129798 0.024009 0.273454 0.265767 -0.007686 129 129 24 20
8 0,5 v3 -1.425380 -0.894411 0.530969 0.138942 0.146284 0.007342 0.414750 0.393061 -0.021689 129 129 41 37
Saved CSVs to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - test_results_all_sets.csv
 - test_results_pruned_vs_original.csv
In [3]:
# === Compare features v1 vs v1.1 vs v1.2 (join on day0 + ticker) ===
# Metrics: R^2, Adjusted R^2, Grouped (by ticker) Cross-Validated R^2
# If needed first:  pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.xlsx", "features v1.1.xlsx", "features v1.2.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists():
            return p
    return None

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands:
        return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    m = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for name in book.keys():
        if is_readme_sheet(name): 
            continue
        for w, pat in pats.items():
            if m[w] is None and pat.search(str(name)):
                m[w] = name
    return m

def find_day0_column(df: pd.DataFrame):
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best, best_nonnull = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > best_nonnull:
            best, best_nonnull = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    # fallback guess
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score:
            best, score = c, sc
    return best

def normalize_day0(s: pd.Series) -> pd.Series:
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series) -> pd.Series:
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    # drop zero-variance predictors
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    n_groups = int(groups.nunique())
    n_splits = max(3, min(max_folds, n_groups))
    gkf = GroupKFold(n_splits=n_splits)
    model = LinearRegression()
    scores = []
    for tr, te in gkf.split(X, y, groups=groups):
        model.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_pred = model.predict(X.iloc[te].values)
        y_true = y.iloc[te].values
        ss_res = np.sum((y_true - y_pred)**2)
        ss_tot = np.sum((y_true - np.mean(y_true))**2)
        r2_test = 1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan
        scores.append(r2_test)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    cv = grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# ---------- LOAD EVENT ----------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
    raise FileNotFoundError("Could not find event_study.xlsx in the configured folders.")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)

# ---------- RUN ----------
present = [f for f in FEATURE_FILES if find_file(f) is not None]
assert present, "None of the v1 files were found."

print("Testing files:", present)

merge_audit = []
results = []

for fname in present:
    fpath = find_file(fname)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    df_feat_raw = feat_book[fsheet].copy()

    dfeat = find_day0_column(df_feat_raw)
    tfeat = find_ticker_column(df_feat_raw)
    feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)

    for w in WINDOWS:
        esheet = win_map.get(w)
        if esheet is None:
            print(f"Missing event sheet for window {w}. Skipping.")
            continue

        df_evt = evt_book[esheet].copy()
        devt = find_day0_column(df_evt)
        tevt = find_ticker_column(df_evt)
        ycol = find_target_col(df_evt)

        evt = df_evt.copy()
        evt["__day0__"]   = normalize_day0(evt[devt])
        evt["__ticker__"] = normalize_ticker(evt[tevt])
        evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

        merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        groups = merged["__ticker__"]
        X = build_X(merged, num_cols, ycol)
        y = merged[ycol].astype(float)

        merge_audit.append({
            "features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
            "day0_features_col": dfeat, "ticker_features_col": tfeat,
            "day0_event_col": devt, "ticker_event_col": tevt,
            "merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
        })

        m = fit_and_score(X, y, groups)
        m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
        results.append(m)

# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)

print("\nMerge audit:")
display(pd.DataFrame(merge_audit))

res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1 vs v1.1 vs v1.2):")
display(res_df)

print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
                          columns="features_file",
                          values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
                          aggfunc="first")
display(wide)

# Save CSVs next to your data
out_dir = evt_path.parent
res_df.to_csv(out_dir / "v1_v1.1_v1.2_results.csv", index=False)
wide.to_csv(out_dir / "v1_v1.1_v1.2_comparison_table.csv")
print(f"\nSaved to: {out_dir}")
print(" - v1_v1.1_v1.2_results.csv")
print(" - v1_v1.1_v1.2_comparison_table.csv")
Testing files: ['features v1.xlsx', 'features v1.1.xlsx', 'features v1.2.xlsx']

Merge audit:
features_file features_sheet event_sheet window day0_features_col ticker_features_col day0_event_col ticker_event_col merged_rows predictor_cols target_col
0 features v1.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 16 CAR
1 features v1.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 16 CAR
2 features v1.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 16 CAR
3 features v1.1.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 14 CAR
4 features v1.1.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 14 CAR
5 features v1.1.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 14 CAR
6 features v1.2.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 8 CAR
7 features v1.2.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 8 CAR
8 features v1.2.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 8 CAR
Results (v1 vs v1.1 vs v1.2):
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared features_file features_sheet window
0 129 14 0.289191 0.201899 -0.047634 features v1.1.xlsx features 0,1
1 129 8 0.245160 0.194838 0.068034 features v1.2.xlsx features 0,1
2 129 16 0.303485 0.203983 -0.115372 features v1.xlsx features 0,1
3 129 14 0.238836 0.145359 -0.062822 features v1.1.xlsx features 0,3
4 129 8 0.201481 0.148246 0.094267 features v1.2.xlsx features 0,3
5 129 16 0.250824 0.143799 -0.155072 features v1.xlsx features 0,3
6 129 14 0.248291 0.155976 -0.008339 features v1.1.xlsx features 0,5
7 129 8 0.214735 0.162384 0.121771 features v1.2.xlsx features 0,5
8 129 16 0.257400 0.151314 -0.089552 features v1.xlsx features 0,5
Comparison table (rows = windows | columns = metrics per file):
adjusted_r_squared cross_validated_r_squared r_squared
features_file features v1.1.xlsx features v1.2.xlsx features v1.xlsx features v1.1.xlsx features v1.2.xlsx features v1.xlsx features v1.1.xlsx features v1.2.xlsx features v1.xlsx
window
0,1 0.201899 0.194838 0.203983 -0.047634 0.068034 -0.115372 0.289191 0.245160 0.303485
0,3 0.145359 0.148246 0.143799 -0.062822 0.094267 -0.155072 0.238836 0.201481 0.250824
0,5 0.155976 0.162384 0.151314 -0.008339 0.121771 -0.089552 0.248291 0.214735 0.257400
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v1_v1.1_v1.2_results.csv
 - v1_v1.1_v1.2_comparison_table.csv
In [11]:
# === Compare features v1.2 vs v1.3 (join on day0 + ticker) ===
# Metrics: R^2, Adjusted R^2, Grouped Cross-Validated R^2 (by ticker)
# If needed first:  pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.3.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists():
            return p
    return None

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands:
        return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    m = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for name in book.keys():
        if is_readme_sheet(name): 
            continue
        for w, pat in pats.items():
            if m[w] is None and pat.search(str(name)):
                m[w] = name
    return m

def find_day0_column(df: pd.DataFrame):
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    # most date-like
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest: best, kbest = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    # fallback guess
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score:
            best, score = c, sc
    return best

def normalize_day0(s: pd.Series) -> pd.Series:
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series) -> pd.Series:
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    # drop zero-variance predictors
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    n_groups = int(groups.nunique())
    n_splits = max(3, min(max_folds, n_groups))
    gkf = GroupKFold(n_splits=n_splits)
    model = LinearRegression()
    scores = []
    for tr, te in gkf.split(X, y, groups=groups):
        model.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_pred = model.predict(X.iloc[te].values)
        y_true = y.iloc[te].values
        ss_res = np.sum((y_true - y_pred)**2)
        ss_tot = np.sum((y_true - np.mean(y_true))**2)
        r2_test = 1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan
        scores.append(r2_test)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    cv = grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# ---------- LOAD EVENT ----------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
    raise FileNotFoundError("Could not find event_study.xlsx in the configured folders.")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)

# ---------- RUN ----------
present = [f for f in FEATURE_FILES if find_file(f) is not None]
assert present, "features v1.2.xlsx and features v1.3.xlsx were not found."

print("Testing files:", present)

merge_audit = []
results = []

for fname in present:
    fpath = find_file(fname)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    df_feat_raw = feat_book[fsheet].copy()

    dfeat = find_day0_column(df_feat_raw)
    tfeat = find_ticker_column(df_feat_raw)
    feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)

    for w in WINDOWS:
        esheet = win_map.get(w)
        if esheet is None:
            print(f"Missing event sheet for window {w}. Skipping.")
            continue

        df_evt = evt_book[esheet].copy()
        devt = find_day0_column(df_evt)
        tevt = find_ticker_column(df_evt)
        ycol = find_target_col(df_evt)

        evt = df_evt.copy()
        evt["__day0__"]   = normalize_day0(evt[devt])
        evt["__ticker__"] = normalize_ticker(evt[tevt])
        evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

        merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        groups = merged["__ticker__"]
        X = build_X(merged, num_cols, ycol)
        y = merged[ycol].astype(float)

        merge_audit.append({
            "features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
            "day0_features_col": dfeat, "ticker_features_col": tfeat,
            "day0_event_col": devt, "ticker_event_col": tevt,
            "merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
        })

        m = fit_and_score(X, y, groups)
        m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
        results.append(m)

# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)

print("\nMerge audit:")
display(pd.DataFrame(merge_audit))

res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.3):")
display(res_df)

print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
                          columns="features_file",
                          values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
                          aggfunc="first")
display(wide)

# ---------- DELTAS (v1.3 - v1.2) ----------
if set(FEATURE_FILES).issubset(set(res_df["features_file"].unique())):
    pairs = []
    for w in WINDOWS:
        a = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
        b = res_df[(res_df["features_file"]=="features v1.3.xlsx") & (res_df["window"]==w)]
        if not a.empty and not b.empty:
            pairs.append({
                "window": w,
                "delta_cross_validated_r_squared": float(b["cross_validated_r_squared"].iloc[0] - a["cross_validated_r_squared"].iloc[0]),
                "delta_adjusted_r_squared": float(b["adjusted_r_squared"].iloc[0] - a["adjusted_r_squared"].iloc[0]),
                "delta_r_squared": float(b["r_squared"].iloc[0] - a["r_squared"].iloc[0]),
                "rows_used_v1.2": int(a["rows_used"].iloc[0]),
                "rows_used_v1.3": int(b["rows_used"].iloc[0]),
                "features_used_v1.2": int(a["features_used"].iloc[0]),
                "features_used_v1.3": int(b["features_used"].iloc[0]),
            })
    if pairs:
        deltas = pd.DataFrame(pairs)
        print("\nDeltas (v1.3 minus v1.2) — positive is good:")
        display(deltas)

# Save CSVs next to your data
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "v1.2_vs_v1.3_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_vs_v1.3_comparison_table.csv")
print(f"\nSaved to: {out_dir}")
print(" - v1.2_vs_v1.3_results.csv")
print(" - v1.2_vs_v1.3_comparison_table.csv")
Testing files: ['features v1.2.xlsx', 'features v1.3.xlsx']

Merge audit:
features_file features_sheet event_sheet window day0_features_col ticker_features_col day0_event_col ticker_event_col merged_rows predictor_cols target_col
0 features v1.2.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 8 CAR
1 features v1.2.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 8 CAR
2 features v1.2.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 8 CAR
3 features v1.3.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 16 CAR
4 features v1.3.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 16 CAR
5 features v1.3.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 16 CAR
Results (v1.2 vs v1.3):
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared features_file features_sheet window
0 129 8 0.245160 0.194838 0.068034 features v1.2.xlsx features 0,1
1 129 16 0.259498 0.153712 -0.337959 features v1.3.xlsx features 0,1
2 129 8 0.201481 0.148246 0.094267 features v1.2.xlsx features 0,3
3 129 16 0.216124 0.104142 -0.111012 features v1.3.xlsx features 0,3
4 129 8 0.214735 0.162384 0.121771 features v1.2.xlsx features 0,5
5 129 16 0.226652 0.116174 -0.037386 features v1.3.xlsx features 0,5
Comparison table (rows = windows | columns = metrics per file):
adjusted_r_squared cross_validated_r_squared r_squared
features_file features v1.2.xlsx features v1.3.xlsx features v1.2.xlsx features v1.3.xlsx features v1.2.xlsx features v1.3.xlsx
window
0,1 0.194838 0.153712 0.068034 -0.337959 0.245160 0.259498
0,3 0.148246 0.104142 0.094267 -0.111012 0.201481 0.216124
0,5 0.162384 0.116174 0.121771 -0.037386 0.214735 0.226652
Deltas (v1.3 minus v1.2) — positive is good:
window delta_cross_validated_r_squared delta_adjusted_r_squared delta_r_squared rows_used_v1.2 rows_used_v1.3 features_used_v1.2 features_used_v1.3
0 0,1 -0.405993 -0.041126 0.014337 129 129 8 16
1 0,3 -0.205279 -0.044105 0.014643 129 129 8 16
2 0,5 -0.159157 -0.046209 0.011918 129 129 8 16
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v1.2_vs_v1.3_results.csv
 - v1.2_vs_v1.3_comparison_table.csv
In [19]:
# === FIXED: Grow v1.2 using v3 candidates with safe group-aware CV ===
# - Adapts folds to the number of tickers in each split
# - Falls back to ordinary KFold when groups are too few
# - Same outputs as before (baseline, marginal gains, selected features, summary)

from pathlib import Path
import re
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
BASE_FILE = "features v1.2.xlsx"
POOL_FILE = "features v3.xlsx"
WINDOWS = ["0,1","0,3","0,5"]

MAX_OUTER_FOLDS = 5
MAX_INNER_FOLDS = 3
MAX_FEATURES_TO_ADD = 5
MIN_GAIN = 0.01

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(f"Could not find {name}")

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands: return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    m = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for name in book:
        if is_readme_sheet(name): continue
        for w, pat in pats.items():
            if m[w] is None and pat.search(str(name)): m[w] = name
    return m

def find_day0_column(df: pd.DataFrame):
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest: best, kbest = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score: best, score = c, sc
    return best

def normalize_day0(s: pd.Series) -> pd.Series:
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series) -> pd.Series:
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def test_r2_on_fold(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_hat = model.predict(X_test)
    ss_res = np.sum((y_test - y_hat)**2)
    ss_tot = np.sum((y_test - np.mean(y_test))**2)
    return 1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan

def safe_group_cv_scores(X, y, groups, max_splits=5, min_splits=2):
    """Return mean test R^2 and the splitter used.
       Uses GroupKFold if groups are enough, else falls back to ordinary KFold."""
    n_groups = int(pd.Series(groups).nunique())
    if n_groups >= min_splits:
        n_splits = min(max_splits, n_groups)
        splitter = GroupKFold(n_splits=n_splits)
        model = LinearRegression()
        scores = []
        for tr, te in splitter.split(X, y, groups=groups):
            scores.append(test_r2_on_fold(model, X.iloc[tr].values, y.iloc[tr].values,
                                          X.iloc[te].values, y.iloc[te].values))
        return float(np.nanmean(scores)), splitter
    # fallback: ordinary KFold
    n = len(X)
    if n < 3:
        return np.nan, None
    n_splits = min(3, n)
    splitter = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    model = LinearRegression()
    scores = []
    for tr, te in splitter.split(X, y):
        scores.append(test_r2_on_fold(model, X.iloc[tr].values, y.iloc[tr].values,
                                      X.iloc[te].values, y.iloc[te].values))
    return float(np.nanmean(scores)), splitter

def in_sample_and_adjusted(X: pd.DataFrame, y: pd.Series):
    if X.shape[1] == 0:
        return np.nan, np.nan
    mdl = LinearRegression().fit(X.values, y.values)
    r2 = float(mdl.score(X.values, y.values))
    n, p = len(y), X.shape[1]
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    return r2, adj

# ---------- LOAD ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)

base_path = find_file(BASE_FILE)
pool_path = find_file(POOL_FILE)

base_book = pd.read_excel(base_path, sheet_name=None, engine="openpyxl")
base_sheet = choose_features_sheet(base_book)
base_raw = base_book[base_sheet].copy()
base_day0 = find_day0_column(base_raw)
base_ticker = find_ticker_column(base_raw)
base_grouped, base_num_cols = aggregate_features(base_raw, base_day0, base_ticker)

pool_book = pd.read_excel(pool_path, sheet_name=None, engine="openpyxl")
pool_sheet = choose_features_sheet(pool_book)
pool_raw = pool_book[pool_sheet].copy()
pool_day0 = find_day0_column(pool_raw)
pool_ticker = find_ticker_column(pool_raw)
pool_grouped, pool_num_cols = aggregate_features(pool_raw, pool_day0, pool_ticker)

candidate_cols = [c for c in pool_num_cols if c not in base_num_cols]

# ---------- WORK ----------
all_quick = []
all_selected = []
all_summary = []

for window in WINDOWS:
    esheet = win_map.get(window)
    if esheet is None:
        print(f"Skip window {window}: event sheet not found.")
        continue

    df_evt = evt_book[esheet].copy()
    event_day0 = find_day0_column(df_evt)
    event_ticker = find_ticker_column(df_evt)
    y_col = find_target_col(df_evt)

    evt = df_evt.copy()
    evt["__day0__"]   = normalize_day0(evt[event_day0])
    evt["__ticker__"] = normalize_ticker(evt[event_ticker])
    evt = evt.dropna(subset=["__day0__","__ticker__", y_col]).drop_duplicates(subset=["__day0__","__ticker__"])

    merged_base = base_grouped.merge(evt[["__day0__","__ticker__", y_col]], on=["__day0__","__ticker__"], how="inner")
    merged_pool = pool_grouped[["__day0__","__ticker__"] + candidate_cols]
    merged = merged_base.merge(merged_pool, on=["__day0__","__ticker__"], how="left")

    X_base = build_X(merged, base_num_cols, y_col)
    y = merged[y_col].astype(float)
    groups = merged["__ticker__"]

    # Baseline
    base_cv, outer_splitter = safe_group_cv_scores(X_base, y, groups, max_splits=MAX_OUTER_FOLDS, min_splits=2)
    base_r2, base_adj = in_sample_and_adjusted(X_base, y)

    # Quick marginal gains (add one feature at a time)
    quick_rows = []
    for c in candidate_cols:
        if c not in merged.columns: 
            continue
        Xt = pd.concat([X_base, merged[[c]]], axis=1)
        data = pd.concat([y, Xt], axis=1).dropna()
        y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
        if X_c.shape[1] == 0 or len(y_c) < 10: 
            continue
        cv_r2, _ = safe_group_cv_scores(X_c, y_c, groups.loc[X_c.index], max_splits=MAX_OUTER_FOLDS, min_splits=2)
        quick_rows.append({"window": window, "feature": c, "cv_with_feature": cv_r2, "delta": cv_r2 - base_cv})
    quick_df = pd.DataFrame(quick_rows).sort_values("delta", ascending=False).reset_index(drop=True)
    all_quick.append(quick_df)

    # Nested grouped forward selection (safe inner splitting)
    # Build outer splitter (group-aware if possible)
    outer_scores = []
    fold_selected = []

    # If we could not build a group-aware splitter (very rare), fall back to KFold
    if isinstance(outer_splitter, GroupKFold):
        splits = list(outer_splitter.split(X_base, y, groups=groups))
    elif isinstance(outer_splitter, KFold):
        splits = list(outer_splitter.split(X_base, y))
    else:
        # no splitter possible
        splits = [(np.arange(len(X_base)), np.arange(0))]

    for tr, te in splits:
        Xb_tr, Xb_te = X_base.iloc[tr], X_base.iloc[te]
        y_tr, y_te = y.iloc[tr], y.iloc[te]
        groups_tr = groups.iloc[tr]

        # inner helper with safe group CV on training fold
        def inner_cv_score(Xt, yt):
            return safe_group_cv_scores(Xt, yt, groups_tr.loc[Xt.index], max_splits=MAX_INNER_FOLDS, min_splits=2)[0]

        # start point
        data_tr = pd.concat([y_tr, Xb_tr], axis=1).dropna()
        y_tr_c, X_tr_c = data_tr.iloc[:,0], data_tr.iloc[:,1:]
        base_inner = inner_cv_score(X_tr_c, y_tr_c)

        avail = [c for c in candidate_cols if c in merged.columns]
        chosen = []

        for _ in range(MAX_FEATURES_TO_ADD):
            best_gain, best_feat = 0.0, None
            for c in avail:
                col = merged.loc[Xb_tr.index, c]
                Xt = pd.concat([X_tr_c, col], axis=1).dropna()
                yt = y_tr.loc[Xt.index]
                if Xt.shape[1] == 0 or len(yt) < 10:
                    continue
                score = inner_cv_score(Xt, yt)
                gain = score - base_inner
                if gain > best_gain:
                    best_gain, best_feat = gain, c
            if best_feat is None or best_gain < MIN_GAIN:
                break
            # accept and update
            chosen.append(best_feat)
            avail.remove(best_feat)
            X_tr_c = pd.concat([X_tr_c, merged.loc[Xb_tr.index, [best_feat]]], axis=1).dropna()
            y_tr_c = y_tr.loc[X_tr_c.index]
            base_inner = inner_cv_score(X_tr_c, y_tr_c)

        fold_selected.append(chosen)

        # evaluate on outer test fold
        X_te = Xb_te.copy()
        if chosen:
            X_te = pd.concat([X_te, merged.loc[Xb_te.index, chosen]], axis=1)
        data_te = pd.concat([y_te, X_te], axis=1).dropna()
        y_te_c, X_te_c = data_te.iloc[:,0], data_te.iloc[:,1:]
        if X_te_c.shape[1] == 0 or len(y_te_c) < 2:
            outer_scores.append(np.nan)
        else:
            outer_scores.append(test_r2_on_fold(LinearRegression(),
                                                X_tr_c.values, y_tr_c.values,
                                                X_te_c.values, y_te_c.values))

    # Frequencies across outer folds
    flat = [f for sub in fold_selected for f in sub]
    freq = pd.Series(flat).value_counts().rename("selected_in_folds").to_frame()
    freq["window"] = window
    freq = freq.reset_index().rename(columns={"index":"feature"})
    all_selected.append(freq)

    # Union of features picked in at least half the folds
    keep_union = []
    if not freq.empty and len(splits) > 0:
        half = max(1, int(np.ceil(len(splits)/2)))
        keep_union = freq.loc[freq["selected_in_folds"] >= half, "feature"].tolist()

    X_full = X_base.copy()
    if keep_union:
        X_full = pd.concat([X_full, merged[keep_union]], axis=1)
    data_full = pd.concat([y, X_full], axis=1).dropna()
    y_full, X_full_c = data_full.iloc[:,0], data_full.iloc[:,1:]
    full_cv, _ = safe_group_cv_scores(X_full_c, y_full, groups.loc[X_full_c.index], max_splits=MAX_OUTER_FOLDS, min_splits=2)
    full_r2, full_adj = in_sample_and_adjusted(X_full_c, y_full)

    all_summary.append({
        "window": window,
        "baseline_cross_validated_r_squared": base_cv,
        "baseline_r_squared": base_r2,
        "baseline_adjusted_r_squared": base_adj,
        "nested_forward_mean_test_cross_validated_r_squared": float(np.nanmean(outer_scores)) if outer_scores else np.nan,
        "selected_union_features": ", ".join(keep_union) if keep_union else "",
        "union_model_cross_validated_r_squared": full_cv,
        "union_model_r_squared": full_r2,
        "union_model_adjusted_r_squared": full_adj,
        "n_selected_union": len(keep_union)
    })

# ---------- REPORT ----------
quick_all = pd.concat(all_quick, ignore_index=True) if all_quick else pd.DataFrame()
selected_all = pd.concat(all_selected, ignore_index=True) if all_selected else pd.DataFrame()
summary = pd.DataFrame(all_summary)

pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", None)

print("\n=== Baseline vs improved (per window) ===")
display(summary)

if not quick_all.empty:
    print("\n=== Quick marginal gains (top 25 per window) — delta vs v1.2 baseline cross validated coefficient of determination ===")
    display(quick_all.sort_values(["window","delta"], ascending=[True, False]).groupby("window").head(25))

if not selected_all.empty:
    print("\n=== Features selected by nested forward selection (frequency across outer folds) ===")
    display(selected_all.sort_values(["window","selected_in_folds"], ascending=[True, False]))

# ---------- SAVE ----------
out_dir = find_file(EVENT_FILE).parent
summary.to_csv(out_dir / "v12_growth_summary.csv", index=False)
if not quick_all.empty:
    quick_all.to_csv(out_dir / "v12_growth_quick_marginal_gains.csv", index=False)
if not selected_all.empty:
    selected_all.to_csv(out_dir / "v12_growth_selected_frequencies.csv", index=False)

print(f"\nSaved to: {out_dir}")
print(" - v12_growth_summary.csv")
print(" - v12_growth_quick_marginal_gains.csv")
print(" - v12_growth_selected_frequencies.csv")
=== Baseline vs improved (per window) ===
window baseline_cross_validated_r_squared baseline_r_squared baseline_adjusted_r_squared nested_forward_mean_test_cross_validated_r_squared selected_union_features union_model_cross_validated_r_squared union_model_r_squared union_model_adjusted_r_squared n_selected_union
0 0,1 0.068034 0.245160 0.194838 -0.121056 baa_minus_aaa_bp, pre_vol_10d 0.040045 0.272545 0.210896 2
1 0,3 0.094267 0.201481 0.148246 -0.121196 pre_vol_10d 0.093876 0.212336 0.152765 1
2 0,5 0.121771 0.214735 0.162384 -0.034418 pre_vol_10d 0.120651 0.221060 0.162149 1
=== Quick marginal gains (top 25 per window) — delta vs v1.2 baseline cross validated coefficient of determination ===
window feature cv_with_feature delta
0 0,1 macro_cpi_yoy 0.081215 1.318111e-02
1 0,1 pre_vol_10d 0.072635 4.601765e-03
2 0,1 rates_x_surprise 0.069896 1.862384e-03
3 0,1 is_amc 0.068034 -8.881784e-16
4 0,1 is_bmo 0.068034 -8.881784e-16
5 0,1 is_friday 0.068034 -8.881784e-16
6 0,1 pre_ret_10d 0.065582 -2.451491e-03
7 0,1 cpi_x_prevol5d 0.063162 -4.871555e-03
8 0,1 is_monday 0.060641 -7.392425e-03
9 0,1 vix_chg_10d_lag1 0.060394 -7.639289e-03
10 0,1 high_vix_regime 0.057423 -1.061060e-02
11 0,1 cpi_x_surprise 0.054169 -1.386500e-02
12 0,1 is_january 0.051872 -1.616123e-02
13 0,1 investment_grade_option_adjusted_spread_bp 0.050660 -1.737398e-02
14 0,1 investment_grade_option_adjusted_spread_pct 0.050660 -1.737398e-02
15 0,1 macro_fedfunds 0.050613 -1.742084e-02
16 0,1 high_rates_regime 0.049974 -1.805981e-02
17 0,1 pre_vol_3d 0.048157 -1.987686e-02
18 0,1 weekly_density 0.048107 -1.992688e-02
19 0,1 high_density_week 0.048107 -1.992688e-02
20 0,1 month 0.046575 -2.145911e-02
21 0,1 baa_minus_aaa_bp 0.044385 -2.364825e-02
22 0,1 baa_minus_aaa_pct 0.044385 -2.364825e-02
23 0,1 mkt_ret_1d_lag1 0.041337 -2.669715e-02
24 0,1 quarter 0.040292 -2.774160e-02
36 0,3 macro_cpi_yoy 0.096439 2.172138e-03
37 0,3 is_amc 0.094267 6.383782e-16
38 0,3 is_friday 0.094267 6.383782e-16
39 0,3 is_bmo 0.094267 6.383782e-16
40 0,3 pre_vol_10d 0.093876 -3.911947e-04
41 0,3 is_monday 0.093087 -1.179961e-03
42 0,3 macro_fedfunds 0.090032 -4.235049e-03
43 0,3 rates_x_surprise 0.089324 -4.943540e-03
44 0,3 cpi_x_prevol5d 0.088516 -5.751032e-03
45 0,3 investment_grade_option_adjusted_spread_bp 0.088365 -5.902608e-03
46 0,3 investment_grade_option_adjusted_spread_pct 0.088365 -5.902608e-03
47 0,3 cpi_x_surprise 0.080796 -1.347118e-02
48 0,3 pre_vol_3d 0.080411 -1.385620e-02
49 0,3 high_rates_regime 0.079143 -1.512391e-02
50 0,3 is_january 0.078816 -1.545143e-02
51 0,3 high_vix_regime 0.077928 -1.633957e-02
52 0,3 pre_ret_10d 0.077158 -1.710937e-02
53 0,3 baa_minus_aaa_pct 0.076349 -1.791780e-02
54 0,3 baa_minus_aaa_bp 0.076349 -1.791780e-02
55 0,3 weekly_density 0.074380 -1.988693e-02
56 0,3 high_density_week 0.074380 -1.988693e-02
57 0,3 month 0.068821 -2.544574e-02
58 0,3 mkt_ret_1d_lag1 0.067033 -2.723425e-02
59 0,3 quarter 0.059743 -3.452387e-02
60 0,3 day_of_week 0.054746 -3.952102e-02
72 0,5 is_amc 0.121771 1.221245e-15
73 0,5 is_bmo 0.121771 1.221245e-15
74 0,5 is_friday 0.121771 1.221245e-15
75 0,5 pre_vol_10d 0.120651 -1.120119e-03
76 0,5 macro_fedfunds 0.119332 -2.438700e-03
77 0,5 macro_cpi_yoy 0.117592 -4.178903e-03
78 0,5 high_density_week 0.116452 -5.318717e-03
79 0,5 weekly_density 0.116452 -5.318717e-03
80 0,5 cpi_x_prevol5d 0.115113 -6.657469e-03
81 0,5 is_monday 0.114571 -7.199965e-03
82 0,5 investment_grade_option_adjusted_spread_bp 0.113210 -8.560419e-03
83 0,5 investment_grade_option_adjusted_spread_pct 0.113210 -8.560419e-03
84 0,5 high_vix_regime 0.111899 -9.871706e-03
85 0,5 high_rates_regime 0.111492 -1.027892e-02
86 0,5 baa_minus_aaa_bp 0.110206 -1.156436e-02
87 0,5 baa_minus_aaa_pct 0.110206 -1.156436e-02
88 0,5 is_january 0.108703 -1.306799e-02
89 0,5 pre_ret_10d 0.106843 -1.492733e-02
90 0,5 pre_vol_3d 0.105928 -1.584277e-02
91 0,5 month 0.101969 -1.980198e-02
92 0,5 rates_x_surprise 0.097229 -2.454192e-02
93 0,5 mkt_ret_1d_lag1 0.097109 -2.466205e-02
94 0,5 cpi_x_surprise 0.095834 -2.593715e-02
95 0,5 quarter 0.092512 -2.925895e-02
96 0,5 vix_chg_10d_lag1 0.091427 -3.034321e-02
=== Features selected by nested forward selection (frequency across outer folds) ===
feature selected_in_folds window
0 baa_minus_aaa_bp 2 0,1
1 pre_vol_10d 2 0,1
2 is_january 1 0,1
3 is_q4 1 0,1
4 month 1 0,1
5 day_of_week 1 0,1
6 rates_x_surprise 1 0,1
7 moody_aaa_yield_pct 1 0,1
8 vix_x_surprise 1 0,1
9 macro_cpi_yoy 1 0,1
10 pre_vol_10d 2 0,3
11 is_q4 1 0,3
12 is_january 1 0,3
13 day_of_week 1 0,3
14 rates_x_surprise 1 0,3
15 vix_x_surprise 1 0,3
16 moody_aaa_yield_pct 1 0,3
17 high_vix_regime 1 0,3
18 is_monday 1 0,3
19 pre_vol_10d 2 0,5
20 month 1 0,5
21 rates_x_surprise 1 0,5
22 is_january 1 0,5
23 baa_minus_aaa_pct 1 0,5
24 day_of_week 1 0,5
25 vix_x_surprise 1 0,5
26 baa_minus_aaa_bp 1 0,5
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v12_growth_summary.csv
 - v12_growth_quick_marginal_gains.csv
 - v12_growth_selected_frequencies.csv
In [21]:
# === Compare features v1.2 vs v1.4 (join on day0 + ticker) ===
# Metrics: R^2, Adjusted R^2, Grouped Cross-Validated R^2 (ticker-aware)
# If needed first:  pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.4.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists():
            return p
    return None

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands:
        return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    m = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for name in book.keys():
        if is_readme_sheet(name): 
            continue
        for w, pat in pats.items():
            if m[w] is None and pat.search(str(name)):
                m[w] = name
    return m

def find_day0_column(df: pd.DataFrame):
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best, best_nonnull = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > best_nonnull:
            best, best_nonnull = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score:
            best, score = c, sc
    return best

def normalize_day0(s: pd.Series) -> pd.Series:
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series) -> pd.Series:
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    # drop zero-variance predictors
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    """Mean test R^2 using GroupKFold when possible; KFold fallback if too few groups."""
    n_groups = int(pd.Series(groups).nunique())
    if n_groups >= 2:
        n_splits = min(max_folds, n_groups)
        gkf = GroupKFold(n_splits=n_splits)
        scores = []
        mdl = LinearRegression()
        for tr, te in gkf.split(X, y, groups=groups):
            mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
            y_pred = mdl.predict(X.iloc[te].values)
            y_true = y.iloc[te].values
            ss_res = np.sum((y_true - y_pred)**2)
            ss_tot = np.sum((y_true - np.mean(y_true))**2)
            scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
        return float(np.nanmean(scores))
    # fallback: plain KFold
    n = len(X)
    if n < 3:
        return np.nan
    n_splits = min(3, n)
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    scores = []
    mdl = LinearRegression()
    for tr, te in kf.split(X, y):
        mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_pred = mdl.predict(X.iloc[te].values)
        y_true = y.iloc[te].values
        ss_res = np.sum((y_true - y_pred)**2)
        ss_tot = np.sum((y_true - np.mean(y_true))**2)
        scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# ---------- LOAD EVENT ----------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
    raise FileNotFoundError("event_study.xlsx not found in the configured folders.")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)

# ---------- RUN ----------
present = [f for f in FEATURE_FILES if find_file(f) is not None]
assert present, "Could not find features v1.2.xlsx or features v1.4.xlsx."

print("Testing files:", present)

merge_audit = []
results = []

for fname in present:
    fpath = find_file(fname)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    df_feat_raw = feat_book[fsheet].copy()

    dfeat = find_day0_column(df_feat_raw)
    tfeat = find_ticker_column(df_feat_raw)
    feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)

    for w in WINDOWS:
        esheet = win_map.get(w)
        if esheet is None:
            print(f"Missing event sheet for window {w}. Skipping.")
            continue

        df_evt = evt_book[esheet].copy()
        devt = find_day0_column(df_evt)
        tevt = find_ticker_column(df_evt)
        ycol = find_target_col(df_evt)

        evt = df_evt.copy()
        evt["__day0__"]   = normalize_day0(evt[devt])
        evt["__ticker__"] = normalize_ticker(evt[tevt])
        evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

        merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        groups = merged["__ticker__"]
        X = build_X(merged, num_cols, ycol)
        y = merged[ycol].astype(float)

        merge_audit.append({
            "features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
            "day0_features_col": dfeat, "ticker_features_col": tfeat,
            "day0_event_col": devt, "ticker_event_col": tevt,
            "merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
        })

        m = fit_and_score(X, y, groups)
        m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
        results.append(m)

# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)

print("\nMerge audit:")
display(pd.DataFrame(merge_audit))

res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.4):")
display(res_df)

print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
                          columns="features_file",
                          values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
                          aggfunc="first")
display(wide)

# ---------- DELTAS (v1.4 - v1.2) ----------
pairs = []
for w in WINDOWS:
    a = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
    b = res_df[(res_df["features_file"]=="features v1.4.xlsx") & (res_df["window"]==w)]
    if not a.empty and not b.empty:
        pairs.append({
            "window": w,
            "delta_cross_validated_r_squared": float(b["cross_validated_r_squared"].iloc[0] - a["cross_validated_r_squared"].iloc[0]),
            "delta_adjusted_r_squared": float(b["adjusted_r_squared"].iloc[0] - a["adjusted_r_squared"].iloc[0]),
            "delta_r_squared": float(b["r_squared"].iloc[0] - a["r_squared"].iloc[0]),
            "rows_used_v1.2": int(a["rows_used"].iloc[0]),
            "rows_used_v1.4": int(b["rows_used"].iloc[0]),
            "features_used_v1.2": int(a["features_used"].iloc[0]),
            "features_used_v1.4": int(b["features_used"].iloc[0]),
        })
if pairs:
    deltas = pd.DataFrame(pairs)
    print("\nDeltas (v1.4 minus v1.2) — positive is good:")
    display(deltas)

# Save CSVs next to your data
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "v1.2_vs_v1.4_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_vs_v1.4_comparison_table.csv")
print(f"\nSaved to: {out_dir}")
print(" - v1.2_vs_v1.4_results.csv")
print(" - v1.2_vs_v1.4_comparison_table.csv")
Testing files: ['features v1.2.xlsx', 'features v1.4.xlsx']

Merge audit:
features_file features_sheet event_sheet window day0_features_col ticker_features_col day0_event_col ticker_event_col merged_rows predictor_cols target_col
0 features v1.2.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 8 CAR
1 features v1.2.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 8 CAR
2 features v1.2.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 8 CAR
3 features v1.4.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 11 CAR
4 features v1.4.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 11 CAR
5 features v1.4.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 11 CAR
Results (v1.2 vs v1.4):
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared features_file features_sheet window
0 129 8 0.245160 0.194838 0.068034 features v1.2.xlsx features 0,1
1 129 11 0.262463 0.193121 -0.143365 features v1.4.xlsx features 0,1
2 129 8 0.201481 0.148246 0.094267 features v1.2.xlsx features 0,3
3 129 11 0.215955 0.142241 -0.128261 features v1.4.xlsx features 0,3
4 129 8 0.214735 0.162384 0.121771 features v1.2.xlsx features 0,5
5 129 11 0.222797 0.149727 -0.108542 features v1.4.xlsx features 0,5
Comparison table (rows = windows | columns = metrics per file):
adjusted_r_squared cross_validated_r_squared r_squared
features_file features v1.2.xlsx features v1.4.xlsx features v1.2.xlsx features v1.4.xlsx features v1.2.xlsx features v1.4.xlsx
window
0,1 0.194838 0.193121 0.068034 -0.143365 0.245160 0.262463
0,3 0.148246 0.142241 0.094267 -0.128261 0.201481 0.215955
0,5 0.162384 0.149727 0.121771 -0.108542 0.214735 0.222797
Deltas (v1.4 minus v1.2) — positive is good:
window delta_cross_validated_r_squared delta_adjusted_r_squared delta_r_squared rows_used_v1.2 rows_used_v1.4 features_used_v1.2 features_used_v1.4
0 0,1 -0.211399 -0.001716 0.017302 129 129 8 11
1 0,3 -0.222528 -0.006006 0.014474 129 129 8 11
2 0,5 -0.230313 -0.012657 0.008063 129 129 8 11
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v1.2_vs_v1.4_results.csv
 - v1.2_vs_v1.4_comparison_table.csv
In [23]:
# === Compare features v1.2 vs features v1.5 (join on day0 + ticker) ===
# Metrics: coefficient of determination, adjusted coefficient of determination,
#          cross validated coefficient of determination (ticker-aware)
# If needed first:  pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.5.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists():
            return p
    return None

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands:
        return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    m = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for name in book.keys():
        if is_readme_sheet(name): 
            continue
        for w, pat in pats.items():
            if m[w] is None and pat.search(str(name)):
                m[w] = name
    return m

def find_day0_column(df: pd.DataFrame):
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best, best_nonnull = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > best_nonnull:
            best, best_nonnull = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score:
            best, score = c, sc
    return best

def normalize_day0(s: pd.Series) -> pd.Series:
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series) -> pd.Series:
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    """Mean test coefficient of determination using group folds when possible; row folds fallback if too few groups."""
    n_groups = int(pd.Series(groups).nunique())
    if n_groups >= 2:
        n_splits = min(max_folds, n_groups)
        gkf = GroupKFold(n_splits=n_splits)
        scores = []
        mdl = LinearRegression()
        for tr, te in gkf.split(X, y, groups=groups):
            mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
            y_pred = mdl.predict(X.iloc[te].values)
            y_true = y.iloc[te].values
            ss_res = np.sum((y_true - y_pred)**2)
            ss_tot = np.sum((y_true - np.mean(y_true))**2)
            scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
        return float(np.nanmean(scores))
    # fallback to ordinary KFold on rows
    n = len(X)
    if n < 3:
        return np.nan
    n_splits = min(3, n)
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    scores = []
    mdl = LinearRegression()
    for tr, te in kf.split(X, y):
        mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_pred = mdl.predict(X.iloc[te].values)
        y_true = y.iloc[te].values
        ss_res = np.sum((y_true - y_pred)**2)
        ss_tot = np.sum((y_true - np.mean(y_true))**2)
        scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# ---------- LOAD EVENT ----------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
    raise FileNotFoundError("event_study.xlsx not found in the configured folders.")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)

# ---------- RUN ----------
present = [f for f in FEATURE_FILES if find_file(f) is not None]
assert present, "Could not find features v1.2.xlsx or features v1.5.xlsx."

print("Testing files:", present)

merge_audit = []
results = []

for fname in present:
    fpath = find_file(fname)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    df_feat_raw = feat_book[fsheet].copy()

    dfeat = find_day0_column(df_feat_raw)
    tfeat = find_ticker_column(df_feat_raw)
    feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)

    for w in WINDOWS:
        esheet = win_map.get(w)
        if esheet is None:
            print(f"Missing event sheet for window {w}. Skipping.")
            continue

        df_evt = evt_book[esheet].copy()
        devt = find_day0_column(df_evt)
        tevt = find_ticker_column(df_evt)
        ycol = find_target_col(df_evt)

        evt = df_evt.copy()
        evt["__day0__"]   = normalize_day0(evt[devt])
        evt["__ticker__"] = normalize_ticker(evt[tevt])
        evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

        merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        groups = merged["__ticker__"]
        X = build_X(merged, num_cols, ycol)
        y = merged[ycol].astype(float)

        merge_audit.append({
            "features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
            "day0_features_col": dfeat, "ticker_features_col": tfeat,
            "day0_event_col": devt, "ticker_event_col": tevt,
            "merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
        })

        m = fit_and_score(X, y, groups)
        m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
        results.append(m)

# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)

print("\nMerge audit:")
display(pd.DataFrame(merge_audit))

res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.5):")
display(res_df)

print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
                          columns="features_file",
                          values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
                          aggfunc="first")
display(wide)

# ---------- DELTAS (v1.5 minus v1.2) ----------
pairs = []
for w in WINDOWS:
    a = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
    b = res_df[(res_df["features_file"]=="features v1.5.xlsx") & (res_df["window"]==w)]
    if not a.empty and not b.empty:
        pairs.append({
            "window": w,
            "delta_cross_validated_r_squared": float(b["cross_validated_r_squared"].iloc[0] - a["cross_validated_r_squared"].iloc[0]),
            "delta_adjusted_r_squared": float(b["adjusted_r_squared"].iloc[0] - a["adjusted_r_squared"].iloc[0]),
            "delta_r_squared": float(b["r_squared"].iloc[0] - a["r_squared"].iloc[0]),
            "rows_used_v1.2": int(a["rows_used"].iloc[0]),
            "rows_used_v1.5": int(b["rows_used"].iloc[0]),
            "features_used_v1.2": int(a["features_used"].iloc[0]),
            "features_used_v1.5": int(b["features_used"].iloc[0]),
        })
if pairs:
    deltas = pd.DataFrame(pairs)
    print("\nDeltas (v1.5 minus v1.2) — positive is good:")
    display(deltas)

# Save CSVs next to your data
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "v1.2_vs_v1.5_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_vs_v1.5_comparison_table.csv")
print(f"\nSaved to: {out_dir}")
print(" - v1.2_vs_v1.5_results.csv")
print(" - v1.2_vs_v1.5_comparison_table.csv")
Testing files: ['features v1.2.xlsx', 'features v1.5.xlsx']

Merge audit:
features_file features_sheet event_sheet window day0_features_col ticker_features_col day0_event_col ticker_event_col merged_rows predictor_cols target_col
0 features v1.2.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 8 CAR
1 features v1.2.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 8 CAR
2 features v1.2.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 8 CAR
3 features v1.5.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 1 CAR
4 features v1.5.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 1 CAR
5 features v1.5.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 1 CAR
Results (v1.2 vs v1.5):
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared features_file features_sheet window
0 129 8 0.245160 0.194838 0.068034 features v1.2.xlsx features 0,1
1 129 1 0.082047 0.074819 -0.060921 features v1.5.xlsx features 0,1
2 129 8 0.201481 0.148246 0.094267 features v1.2.xlsx features 0,3
3 129 1 0.056217 0.048786 -0.051390 features v1.5.xlsx features 0,3
4 129 8 0.214735 0.162384 0.121771 features v1.2.xlsx features 0,5
5 129 1 0.059075 0.051666 -0.034691 features v1.5.xlsx features 0,5
Comparison table (rows = windows | columns = metrics per file):
adjusted_r_squared cross_validated_r_squared r_squared
features_file features v1.2.xlsx features v1.5.xlsx features v1.2.xlsx features v1.5.xlsx features v1.2.xlsx features v1.5.xlsx
window
0,1 0.194838 0.074819 0.068034 -0.060921 0.245160 0.082047
0,3 0.148246 0.048786 0.094267 -0.051390 0.201481 0.056217
0,5 0.162384 0.051666 0.121771 -0.034691 0.214735 0.059075
Deltas (v1.5 minus v1.2) — positive is good:
window delta_cross_validated_r_squared delta_adjusted_r_squared delta_r_squared rows_used_v1.2 rows_used_v1.5 features_used_v1.2 features_used_v1.5
0 0,1 -0.128955 -0.120019 -0.163114 129 129 8 1
1 0,3 -0.145657 -0.099461 -0.145264 129 129 8 1
2 0,5 -0.156462 -0.110718 -0.155660 129 129 8 1
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v1.2_vs_v1.5_results.csv
 - v1.2_vs_v1.5_comparison_table.csv
In [25]:
# === Compare features v1.2 vs v1.5 vs v1.6 (join on day0 + ticker) ===
# Metrics: R^2, Adjusted R^2, Grouped Cross-Validated R^2
# If needed first:  pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.5.xlsx", "features v1.6.xlsx"]
WINDOWS = ["0,1", "0,3", "0,5"]
MAX_GROUP_FOLDS = 5

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = (b / name)
        if p.exists(): return p
    return None

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands: return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    out = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for nm in book:
        if is_readme_sheet(nm): continue
        for w, pat in pats.items():
            if out[w] is None and pat.search(str(nm)): out[w] = nm
    return out

def find_day0_column(df: pd.DataFrame):
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    # pick most date-like
    best, best_nonnull = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > best_nonnull: best, best_nonnull = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score: best, score = c, sc
    return best

def normalize_day0(s: pd.Series): 
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series): 
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    n_groups = int(pd.Series(groups).nunique())
    mdl = LinearRegression()
    if n_groups >= 2:
        n_splits = min(max_folds, n_groups)
        gkf = GroupKFold(n_splits=n_splits)
        scores = []
        for tr, te in gkf.split(X, y, groups=groups):
            mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
            y_pred = mdl.predict(X.iloc[te].values)
            y_true = y.iloc[te].values
            ss_res = np.sum((y_true - y_pred)**2)
            ss_tot = np.sum((y_true - np.mean(y_true))**2)
            scores.append(1 - ss_res/ss_tot if ss_tot > 0 else np.nan)
        return float(np.nanmean(scores))
    # fallback to KFold on rows
    n = len(X)
    if n < 3: return np.nan
    kf = KFold(n_splits=min(3, n), shuffle=True, random_state=42)
    scores = []
    for tr, te in kf.split(X, y):
        mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_pred = mdl.predict(X.iloc[te].values)
        y_true = y.iloc[te].values
        ss_res = np.sum((y_true - y_pred)**2)
        ss_tot = np.sum((y_true - np.mean(y_true))**2)
        scores.append(1 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1 - (1 - r2)*(n - 1)/(n - p - 1) if (n - p - 1) > 0 else np.nan
    cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# ---------- LOAD EVENT ----------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
    raise FileNotFoundError("event_study.xlsx not found.")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)

# ---------- RUN ----------
present = [f for f in FEATURE_FILES if find_file(f) is not None]
assert present, "Could not find any of: features v1.2.xlsx, v1.5.xlsx, v1.6.xlsx"

print("Testing files:", present)

merge_audit, results = [], []
for fname in present:
    fpath = find_file(fname)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    df_feat_raw = feat_book[fsheet].copy()

    dfeat = find_day0_column(df_feat_raw)
    tfeat = find_ticker_column(df_feat_raw)
    feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)

    for w in WINDOWS:
        esheet = win_map.get(w)
        if esheet is None: 
            print(f"Missing event sheet for window {w}. Skipping.")
            continue

        df_evt = evt_book[esheet].copy()
        devt = find_day0_column(df_evt)
        tevt = find_ticker_column(df_evt)
        ycol = find_target_col(df_evt)

        evt = df_evt.copy()
        evt["__day0__"]   = normalize_day0(evt[devt])
        evt["__ticker__"] = normalize_ticker(evt[tevt])
        evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

        merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        groups = merged["__ticker__"]
        X = build_X(merged, num_cols, ycol)
        y = merged[ycol].astype(float)

        merge_audit.append({
            "features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
            "day0_features_col": dfeat, "ticker_features_col": tfeat,
            "day0_event_col": devt, "ticker_event_col": tevt,
            "merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
        })

        m = fit_and_score(X, y, groups)
        m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
        results.append(m)

# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)

print("\nMerge audit:")
display(pd.DataFrame(merge_audit))

res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.5 vs v1.6):")
display(res_df)

print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
                          columns="features_file",
                          values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
                          aggfunc="first")
display(wide)

# ---------- DELTAS vs baseline v1.2 ----------
pairs = []
for w in WINDOWS:
    base = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
    for alt in ["features v1.5.xlsx", "features v1.6.xlsx"]:
        comp = res_df[(res_df["features_file"]==alt) & (res_df["window"]==w)]
        if not base.empty and not comp.empty:
            pairs.append({
                "window": w,
                "model_vs_v1.2": alt,
                "delta_cross_validated_r_squared": float(comp["cross_validated_r_squared"].iloc[0] - base["cross_validated_r_squared"].iloc[0]),
                "delta_adjusted_r_squared": float(comp["adjusted_r_squared"].iloc[0] - base["adjusted_r_squared"].iloc[0]),
                "delta_r_squared": float(comp["r_squared"].iloc[0] - base["r_squared"].iloc[0]),
                "rows_used_base": int(base["rows_used"].iloc[0]),
                "rows_used_alt": int(comp["rows_used"].iloc[0]),
                "features_used_base": int(base["features_used"].iloc[0]),
                "features_used_alt": int(comp["features_used"].iloc[0]),
            })
if pairs:
    deltas = pd.DataFrame(pairs).sort_values(["window","model_vs_v1.2"]).reset_index(drop=True)
    print("\nDeltas vs v1.2 — positive is good:")
    display(deltas)

# ---------- SAVE ----------
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "v1.2_v1.5_v1.6_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_v1.5_v1.6_comparison_table.csv")
if pairs:
    deltas.to_csv(out_dir / "v1.2_v1.5_v1.6_deltas_vs_v12.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v1.2_v1.5_v1.6_results.csv")
print(" - v1.2_v1.5_v1.6_comparison_table.csv")
print(" - v1.2_v1.5_v1.6_deltas_vs_v12.csv")
Testing files: ['features v1.2.xlsx', 'features v1.5.xlsx', 'features v1.6.xlsx']

Merge audit:
features_file features_sheet event_sheet window day0_features_col ticker_features_col day0_event_col ticker_event_col merged_rows predictor_cols target_col
0 features v1.2.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 8 CAR
1 features v1.2.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 8 CAR
2 features v1.2.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 8 CAR
3 features v1.5.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 1 CAR
4 features v1.5.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 1 CAR
5 features v1.5.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 1 CAR
6 features v1.6.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 7 CAR
7 features v1.6.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 7 CAR
8 features v1.6.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 7 CAR
Results (v1.2 vs v1.5 vs v1.6):
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared features_file features_sheet window
0 129 8 0.245160 0.194838 0.068034 features v1.2.xlsx features 0,1
1 129 1 0.082047 0.074819 -0.060921 features v1.5.xlsx features 0,1
2 129 7 0.171406 0.123471 0.029867 features v1.6.xlsx features 0,1
3 129 8 0.201481 0.148246 0.094267 features v1.2.xlsx features 0,3
4 129 1 0.056217 0.048786 -0.051390 features v1.5.xlsx features 0,3
5 129 7 0.149466 0.100261 0.054352 features v1.6.xlsx features 0,3
6 129 8 0.214735 0.162384 0.121771 features v1.2.xlsx features 0,5
7 129 1 0.059075 0.051666 -0.034691 features v1.5.xlsx features 0,5
8 129 7 0.161872 0.113385 0.075321 features v1.6.xlsx features 0,5
Comparison table (rows = windows | columns = metrics per file):
adjusted_r_squared cross_validated_r_squared r_squared
features_file features v1.2.xlsx features v1.5.xlsx features v1.6.xlsx features v1.2.xlsx features v1.5.xlsx features v1.6.xlsx features v1.2.xlsx features v1.5.xlsx features v1.6.xlsx
window
0,1 0.194838 0.074819 0.123471 0.068034 -0.060921 0.029867 0.245160 0.082047 0.171406
0,3 0.148246 0.048786 0.100261 0.094267 -0.051390 0.054352 0.201481 0.056217 0.149466
0,5 0.162384 0.051666 0.113385 0.121771 -0.034691 0.075321 0.214735 0.059075 0.161872
Deltas vs v1.2 — positive is good:
window model_vs_v1.2 delta_cross_validated_r_squared delta_adjusted_r_squared delta_r_squared rows_used_base rows_used_alt features_used_base features_used_alt
0 0,1 features v1.5.xlsx -0.128955 -0.120019 -0.163114 129 129 8 1
1 0,1 features v1.6.xlsx -0.038166 -0.071367 -0.073754 129 129 8 7
2 0,3 features v1.5.xlsx -0.145657 -0.099461 -0.145264 129 129 8 1
3 0,3 features v1.6.xlsx -0.039915 -0.047985 -0.052015 129 129 8 7
4 0,5 features v1.5.xlsx -0.156462 -0.110718 -0.155660 129 129 8 1
5 0,5 features v1.6.xlsx -0.046450 -0.048998 -0.052863 129 129 8 7
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v1.2_v1.5_v1.6_results.csv
 - v1.2_v1.5_v1.6_comparison_table.csv
 - v1.2_v1.5_v1.6_deltas_vs_v12.csv
In [27]:
# === Grow v1.2 by adding features from v3, while ALWAYS including EPS surprise pct ===
# Outputs:
#   - summary (baseline vs improved) per window
#   - quick one-at-a-time gains for all v3 candidates (vs baseline)
#   - features selected by nested forward selection (with fold frequencies)
#   - CSVs saved next to event_study.xlsx
#
# If needed first: pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
BASE_FILE  = "features v1.2.xlsx"   # baseline feature set
EPS_FILE   = "features v1.5.xlsx"   # has EPS surprise pct (used if base lacks it)
POOL_FILE  = "features v3.xlsx"     # candidate features to try adding

WINDOWS = ["0,1", "0,3", "0,5"]

MAX_OUTER_FOLDS = 5      # grouped by ticker
MAX_INNER_FOLDS = 3      # grouped by ticker on the training fold
MAX_FEATURES_TO_ADD = 5  # cap number of added v3 features
MIN_GAIN = 0.01          # require at least +0.01 CV R^2 (by ticker) to add

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(f"Could not find: {name}")

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands: return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    m = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for name in book.keys():
        if is_readme_sheet(name): continue
        for w, pat in pats.items():
            if m[w] is None and pat.search(str(name)): m[w] = name
    return m

def find_day0_column(df: pd.DataFrame):
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    # most date-like
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest: best, kbest = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score: best, score = c, sc
    return best

def normalize_day0(s: pd.Series) -> pd.Series:
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series) -> pd.Series:
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def safe_group_cv_scores(X, y, groups, max_splits=5, min_splits=2):
    """Return mean test R^2 and the splitter used. Uses GroupKFold when possible; KFold fallback."""
    n_groups = int(pd.Series(groups).nunique())
    if n_groups >= min_splits:
        n_splits = min(max_splits, n_groups)
        splitter = GroupKFold(n_splits=n_splits)
        model = LinearRegression()
        scores = []
        for tr, te in splitter.split(X, y, groups=groups):
            model.fit(X.iloc[tr].values, y.iloc[tr].values)
            y_hat = model.predict(X.iloc[te].values)
            y_true = y.iloc[te].values
            ss_res = np.sum((y_true - y_hat)**2)
            ss_tot = np.sum((y_true - np.mean(y_true))**2)
            scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
        return float(np.nanmean(scores)), splitter
    # fallback to KFold
    n = len(X)
    if n < 3: return np.nan, None
    splitter = KFold(n_splits=min(3, n), shuffle=True, random_state=42)
    model = LinearRegression()
    scores = []
    for tr, te in splitter.split(X, y):
        model.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_hat = model.predict(X.iloc[te].values)
        y_true = y.iloc[te].values
        ss_res = np.sum((y_true - y_hat)**2); ss_tot = np.sum((y_true - np.mean(y_true))**2)
        scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores)), splitter

def in_sample_and_adjusted(X: pd.DataFrame, y: pd.Series):
    if X.shape[1] == 0: return np.nan, np.nan
    mdl = LinearRegression().fit(X.values, y.values)
    r2 = float(mdl.score(X.values, y.values))
    n, p = len(y), X.shape[1]
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    return r2, adj

def find_eps_column(df: pd.DataFrame):
    # Look for something like "EPS surprise pct", case-insensitive, flexible wording
    pats = [
        r"eps.*surpris.*(pct|percent|%)",
        r"earnings.*surpris.*(pct|percent|%)",
        r"eps[_\s]*surpris",  # fallback
        r"surpris[_\s]*(pct|percent|%)"
    ]
    nums = df.select_dtypes(include=[np.number]).columns
    for pat in pats:
        cands = [c for c in df.columns if re.search(pat, str(c), flags=re.IGNORECASE)]
        cands = [c for c in cands if c in nums]
        if cands:
            return cands[0]
    return None

# ---------- LOAD BOOKS ----------
evt_path  = find_file(EVENT_FILE)
evt_book  = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map   = find_event_window_sheets(evt_book)

base_path = find_file(BASE_FILE)
base_book = pd.read_excel(base_path, sheet_name=None, engine="openpyxl")
base_sheet = choose_features_sheet(base_book)
base_raw   = base_book[base_sheet].copy()
base_day0  = find_day0_column(base_raw)
base_tick  = find_ticker_column(base_raw)
base_grp, base_num_cols = aggregate_features(base_raw, base_day0, base_tick)

pool_path = find_file(POOL_FILE)
pool_book = pd.read_excel(pool_path, sheet_name=None, engine="openpyxl")
pool_sheet = choose_features_sheet(pool_book)
pool_raw   = pool_book[pool_sheet].copy()
pool_day0  = find_day0_column(pool_raw)
pool_tick  = find_ticker_column(pool_raw)
pool_grp, pool_num_cols = aggregate_features(pool_raw, pool_day0, pool_tick)

# EPS column: try base first, else v1.5, else v3/pool
eps_col = find_eps_column(base_grp)
if eps_col is None:
    try:
        eps_path = find_file(EPS_FILE)
        eps_book = pd.read_excel(eps_path, sheet_name=None, engine="openpyxl")
        eps_sheet = choose_features_sheet(eps_book)
        eps_raw = eps_book[eps_sheet].copy()
        eps_day0 = find_day0_column(eps_raw); eps_tick = find_ticker_column(eps_raw)
        eps_grp, _ = aggregate_features(eps_raw, eps_day0, eps_tick)
        eps_col = find_eps_column(eps_grp)
        if eps_col is None:
            raise ValueError("Could not find EPS surprise pct in v1.5.")
    except Exception:
        eps_col = find_eps_column(pool_grp)
        if eps_col is None:
            raise ValueError("Could not find an EPS surprise pct column in base, v1.5, or v3.")
        eps_grp = pool_grp[["__day0__","__ticker__", eps_col]].copy()
else:
    eps_grp = base_grp[["__day0__","__ticker__", eps_col]].copy()

# Candidate features from v3 (exclude anything already in base or the EPS column)
candidate_cols = [c for c in pool_num_cols if c not in set(base_num_cols) and c != eps_col]

# ---------- WORK PER WINDOW ----------
all_quick = []
all_selected = []
all_summary = []

for window in WINDOWS:
    esheet = win_map.get(window)
    if esheet is None:
        print(f"Skip window {window}: event sheet not found.")
        continue

    df_evt = evt_book[esheet].copy()
    evt_day0 = find_day0_column(df_evt); evt_tick = find_ticker_column(df_evt); y_col = find_target_col(df_evt)

    evt = df_evt.copy()
    evt["__day0__"]   = normalize_day0(evt[evt_day0])
    evt["__ticker__"] = normalize_ticker(evt[evt_tick])
    evt = evt.dropna(subset=["__day0__","__ticker__", y_col]).drop_duplicates(subset=["__day0__","__ticker__"])

    # Merge base, EPS, and pool keys
    merged_base = base_grp.merge(eps_grp[["__day0__","__ticker__", eps_col]], on=["__day0__","__ticker__"], how="left")
    merged_base = merged_base.merge(evt[["__day0__","__ticker__", y_col]], on=["__day0__","__ticker__"], how="inner")
    merged_pool = pool_grp[["__day0__","__ticker__"] + candidate_cols]
    merged = merged_base.merge(merged_pool, on=["__day0__","__ticker__"], how="left")

    # Build BASE = v1.2 features + EPS column
    base_plus_eps_cols = list(dict.fromkeys(base_num_cols + [eps_col]))  # keep order, drop dup
    X_base = build_X(merged, base_plus_eps_cols, y_col)
    y = merged[y_col].astype(float)
    groups = merged["__ticker__"]

    # Baseline scores
    base_cv, outer_splitter = safe_group_cv_scores(X_base, y, groups, max_splits=MAX_OUTER_FOLDS, min_splits=2)
    base_r2, base_adj = in_sample_and_adjusted(X_base, y)

    # ---- QUICK ONE-AT-A-TIME GAINS (add each v3 candidate to base+EPS) ----
    quick_rows = []
    for c in candidate_cols:
        if c not in merged.columns: continue
        Xt = pd.concat([X_base, merged[[c]]], axis=1)
        data = pd.concat([y, Xt], axis=1).dropna()
        y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
        if X_c.shape[1] == 0 or len(y_c) < 10: continue
        cv_r2, _ = safe_group_cv_scores(X_c, y_c, groups.loc[X_c.index], max_splits=MAX_OUTER_FOLDS, min_splits=2)
        quick_rows.append({"window": window, "feature": c, "cv_with_feature": cv_r2, "delta": cv_r2 - base_cv})
    quick_df = pd.DataFrame(quick_rows).sort_values(["window","delta"], ascending=[True, False]).reset_index(drop=True)
    all_quick.append(quick_df)

    # ---- NESTED FORWARD SELECTION (start from base+EPS, add v3 features if they help) ----
    # Build outer splits
    if outer_splitter is None:
        splits = []
    elif isinstance(outer_splitter, GroupKFold):
        splits = list(outer_splitter.split(X_base, y, groups=groups))
    else:
        splits = list(outer_splitter.split(X_base, y))

    outer_scores = []
    fold_selected = []

    for tr, te in splits:
        Xb_tr, Xb_te = X_base.iloc[tr], X_base.iloc[te]
        y_tr, y_te = y.iloc[tr], y.iloc[te]
        groups_tr = groups.iloc[tr]

        def inner_cv_score(Xt, yt):
            return safe_group_cv_scores(Xt, yt, groups_tr.loc[Xt.index], max_splits=MAX_INNER_FOLDS, min_splits=2)[0]

        # starting point = base+EPS on training data
        data_tr = pd.concat([y_tr, Xb_tr], axis=1).dropna()
        y_tr_c, X_tr_c = data_tr.iloc[:,0], data_tr.iloc[:,1:]
        base_inner = inner_cv_score(X_tr_c, y_tr_c)

        avail = [c for c in candidate_cols if c in merged.columns]
        chosen = []

        for _ in range(MAX_FEATURES_TO_ADD):
            best_gain, best_feat = 0.0, None
            for c in avail:
                col = merged.loc[Xb_tr.index, c]
                Xt = pd.concat([X_tr_c, col], axis=1).dropna()
                yt = y_tr.loc[Xt.index]
                if Xt.shape[1] == 0 or len(yt) < 10: 
                    continue
                score = inner_cv_score(Xt, yt)
                gain = score - base_inner
                if gain > best_gain:
                    best_gain, best_feat = gain, c
            if best_feat is None or best_gain < MIN_GAIN:
                break
            # accept the feature
            chosen.append(best_feat)
            avail.remove(best_feat)
            X_tr_c = pd.concat([X_tr_c, merged.loc[Xb_tr.index, [best_feat]]], axis=1).dropna()
            y_tr_c = y_tr.loc[X_tr_c.index]
            base_inner = inner_cv_score(X_tr_c, y_tr_c)

        fold_selected.append(chosen)

        # evaluate on the outer test fold
        X_te = Xb_te.copy()
        if chosen:
            X_te = pd.concat([X_te, merged.loc[Xb_te.index, chosen]], axis=1)
        data_te = pd.concat([y_te, X_te], axis=1).dropna()
        y_te_c, X_te_c = data_te.iloc[:,0], data_te.iloc[:,1:]
        if X_te_c.shape[1] == 0 or len(y_te_c) < 2:
            outer_scores.append(np.nan)
        else:
            # use training-fitted model on the final training matrix
            mdl = LinearRegression().fit(X_tr_c.values, y_tr_c.values)
            y_hat = mdl.predict(X_te_c.values)
            ss_res = np.sum((y_te_c.values - y_hat)**2)
            ss_tot = np.sum((y_te_c.values - np.mean(y_te_c.values))**2)
            outer_scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)

    # count selections across folds
    flat = [f for sub in fold_selected for f in sub]
    freq = pd.Series(flat).value_counts().rename("selected_in_folds").to_frame()
    freq["window"] = window
    freq = freq.reset_index().rename(columns={"index":"feature"})
    all_selected.append(freq)

    # union of features picked in at least half the folds
    keep_union = []
    if not freq.empty and len(splits) > 0:
        half = max(1, int(np.ceil(len(splits)/2)))
        keep_union = freq.loc[freq["selected_in_folds"] >= half, "feature"].tolist()

    # evaluate union model (base+EPS + union additions) on full sample CV
    X_full = X_base.copy()
    if keep_union:
        X_full = pd.concat([X_full, merged[keep_union]], axis=1)
    data_full = pd.concat([y, X_full], axis=1).dropna()
    y_full, X_full_c = data_full.iloc[:,0], data_full.iloc[:,1:]
    full_cv, _ = safe_group_cv_scores(X_full_c, y_full, groups.loc[X_full_c.index], max_splits=MAX_OUTER_FOLDS, min_splits=2)
    full_r2, full_adj = in_sample_and_adjusted(X_full_c, y_full)

    all_summary.append({
        "window": window,
        "baseline_cross_validated_r_squared": base_cv,
        "baseline_r_squared": base_r2,
        "baseline_adjusted_r_squared": base_adj,
        "selected_union_features": ", ".join(keep_union) if keep_union else "",
        "n_selected_union": len(keep_union),
        "union_model_cross_validated_r_squared": full_cv,
        "union_model_r_squared": full_r2,
        "union_model_adjusted_r_squared": full_adj,
        "nested_forward_mean_test_cross_validated_r_squared": float(np.nanmean(outer_scores)) if outer_scores else np.nan
    })

# ---------- REPORT ----------
quick_all   = pd.concat(all_quick, ignore_index=True) if all_quick else pd.DataFrame()
selected_all= pd.concat(all_selected, ignore_index=True) if all_selected else pd.DataFrame()
summary     = pd.DataFrame(all_summary)

pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", None)

print("\n=== Summary: baseline (v1.2 + EPS) vs improved (added v3 features) ===")
display(summary.sort_values("window"))

if not quick_all.empty:
    print("\n=== Quick marginal gains (top 25 per window) — delta vs baseline CV R^2 ===")
    display(quick_all.sort_values(["window","delta"], ascending=[True, False]).groupby("window").head(25))

if not selected_all.empty:
    print("\n=== Features selected by nested forward selection (freq across outer folds) ===")
    display(selected_all.sort_values(["window","selected_in_folds"], ascending=[True, False]))

# ---------- SAVE ----------
out_dir = evt_path.parent
summary.to_csv(out_dir / "v12_plus_EPS_growth_summary.csv", index=False)
if not quick_all.empty:
    quick_all.to_csv(out_dir / "v12_plus_EPS_quick_gains.csv", index=False)
if not selected_all.empty:
    selected_all.to_csv(out_dir / "v12_plus_EPS_selected_freq.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v12_plus_EPS_growth_summary.csv")
print(" - v12_plus_EPS_quick_gains.csv")
print(" - v12_plus_EPS_selected_freq.csv")
=== Summary: baseline (v1.2 + EPS) vs improved (added v3 features) ===
window baseline_cross_validated_r_squared baseline_r_squared baseline_adjusted_r_squared selected_union_features n_selected_union union_model_cross_validated_r_squared union_model_r_squared union_model_adjusted_r_squared nested_forward_mean_test_cross_validated_r_squared
0 0,1 0.029867 0.171406 0.123471 pre_vol_10d 1 0.032993 0.184116 0.129723 -0.221901
1 0,3 0.054352 0.149466 0.100261 pre_vol_10d 1 0.053016 0.158257 0.102140 -0.207395
2 0,5 0.075321 0.161872 0.113385 pre_vol_10d 1 0.075510 0.166639 0.111082 -0.199900
=== Quick marginal gains (top 25 per window) — delta vs baseline CV R^2 ===
window feature cv_with_feature delta
0 0,1 cpi_x_prevol5d 0.043880 1.401278e-02
1 0,1 pre_vol_10d 0.032993 3.125374e-03
2 0,1 macro_fedfunds 0.032806 2.938385e-03
3 0,1 is_friday 0.029867 -5.932754e-16
4 0,1 is_amc 0.029867 -5.932754e-16
5 0,1 is_bmo 0.029867 -5.932754e-16
6 0,1 high_vix_regime 0.024420 -5.447331e-03
7 0,1 pre_ret_10d 0.023300 -6.567111e-03
8 0,1 rates_x_surprise 0.021839 -8.028273e-03
9 0,1 is_monday 0.021704 -8.162976e-03
10 0,1 is_january 0.021129 -8.738139e-03
11 0,1 pre_vol_3d 0.017442 -1.242591e-02
12 0,1 macro_cpi_yoy 0.013081 -1.678638e-02
13 0,1 high_density_week 0.008183 -2.168488e-02
14 0,1 weekly_density 0.008183 -2.168488e-02
15 0,1 vix_chg_10d_lag1 0.005445 -2.442203e-02
16 0,1 mkt_ret_10d_lag1 0.002847 -2.702007e-02
17 0,1 mkt_ret_1d_lag1 0.001018 -2.884926e-02
18 0,1 high_rates_regime -0.000556 -3.042343e-02
19 0,1 cpi_x_surprise -0.002427 -3.229439e-02
20 0,1 month -0.015157 -4.502429e-02
21 0,1 quarter -0.021307 -5.117470e-02
22 0,1 vix_x_prevol5d -0.037740 -6.760706e-02
23 0,1 vix_x_surprise -0.041374 -7.124191e-02
24 0,1 day_of_week -0.042865 -7.273295e-02
36 0,3 is_monday 0.060106 5.753802e-03
37 0,3 is_friday 0.054352 -9.228729e-16
38 0,3 is_amc 0.054352 -9.228729e-16
39 0,3 is_bmo 0.054352 -9.228729e-16
40 0,3 pre_vol_10d 0.053016 -1.336237e-03
41 0,3 macro_fedfunds 0.052784 -1.568323e-03
42 0,3 cpi_x_prevol5d 0.051568 -2.784127e-03
43 0,3 is_january 0.046472 -7.880373e-03
44 0,3 pre_vol_3d 0.045408 -8.943739e-03
45 0,3 high_vix_regime 0.044154 -1.019809e-02
46 0,3 rates_x_surprise 0.040254 -1.409774e-02
47 0,3 high_rates_regime 0.032729 -2.162309e-02
48 0,3 macro_cpi_yoy 0.031711 -2.264069e-02
49 0,3 pre_ret_10d 0.030065 -2.428709e-02
50 0,3 cpi_x_surprise 0.025314 -2.903777e-02
51 0,3 mkt_ret_1d_lag1 0.023783 -3.056916e-02
52 0,3 high_density_week 0.011312 -4.304005e-02
53 0,3 weekly_density 0.011312 -4.304005e-02
54 0,3 day_of_week 0.009372 -4.498044e-02
55 0,3 month 0.008169 -4.618267e-02
56 0,3 baa_minus_aaa_bp 0.002811 -5.154114e-02
57 0,3 baa_minus_aaa_pct 0.002811 -5.154114e-02
58 0,3 mkt_ret_10d_lag1 0.002333 -5.201875e-02
59 0,3 quarter -0.001250 -5.560216e-02
60 0,3 investment_grade_option_adjusted_spread_bp -0.006374 -6.072564e-02
72 0,5 cpi_x_prevol5d 0.077828 2.507406e-03
73 0,5 is_monday 0.077687 2.365892e-03
74 0,5 pre_vol_10d 0.075510 1.890622e-04
75 0,5 is_friday 0.075321 -5.551115e-16
76 0,5 is_amc 0.075321 -5.551115e-16
77 0,5 is_bmo 0.075321 -5.551115e-16
78 0,5 macro_fedfunds 0.072634 -2.687242e-03
79 0,5 high_vix_regime 0.070251 -5.070044e-03
80 0,5 is_january 0.067968 -7.352718e-03
81 0,5 pre_vol_3d 0.067084 -8.236664e-03
82 0,5 high_rates_regime 0.064595 -1.072590e-02
83 0,5 macro_cpi_yoy 0.056270 -1.905092e-02
84 0,5 pre_ret_10d 0.054410 -2.091083e-02
85 0,5 mkt_ret_1d_lag1 0.051872 -2.344909e-02
86 0,5 high_density_week 0.047707 -2.761439e-02
87 0,5 weekly_density 0.047707 -2.761439e-02
88 0,5 rates_x_surprise 0.042411 -3.291045e-02
89 0,5 month 0.037916 -3.740507e-02
90 0,5 cpi_x_surprise 0.033743 -4.157764e-02
91 0,5 baa_minus_aaa_bp 0.033658 -4.166307e-02
92 0,5 baa_minus_aaa_pct 0.033658 -4.166307e-02
93 0,5 day_of_week 0.028835 -4.648578e-02
94 0,5 quarter 0.027851 -4.746982e-02
95 0,5 vix_chg_10d_lag1 0.027678 -4.764274e-02
96 0,5 mkt_ret_10d_lag1 0.024830 -5.049113e-02
=== Features selected by nested forward selection (freq across outer folds) ===
feature selected_in_folds window
0 pre_vol_10d 2 0,1
1 is_q4 1 0,1
2 is_january 1 0,1
3 month 1 0,1
4 baa_minus_aaa_pct 1 0,1
5 day_of_week 1 0,1
6 cpi_x_prevol5d 1 0,1
7 weekly_density 1 0,1
8 vix_x_surprise 1 0,1
9 rates_x_surprise 1 0,1
10 pre_vol_10d 2 0,3
11 is_q4 1 0,3
12 is_january 1 0,3
13 day_of_week 1 0,3
14 vix_x_surprise 1 0,3
15 weekly_density 1 0,3
16 cpi_x_surprise 1 0,3
17 pre_vol_10d 2 0,5
18 month 1 0,5
19 is_january 1 0,5
20 baa_minus_aaa_bp 1 0,5
21 day_of_week 1 0,5
22 rates_x_surprise 1 0,5
23 cpi_x_prevol5d 1 0,5
24 high_density_week 1 0,5
25 vix_x_surprise 1 0,5
26 high_vix_regime 1 0,5
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v12_plus_EPS_growth_summary.csv
 - v12_plus_EPS_quick_gains.csv
 - v12_plus_EPS_selected_freq.csv
In [31]:
# === Compare features v1.2 vs features v1.5 (join on day0 + ticker) ===
# Metrics: coefficient of determination, adjusted coefficient of determination,
#          cross validated coefficient of determination (ticker-aware)
# If needed first:  pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.5.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists():
            return p
    return None

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands:
        return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    m = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for name in book.keys():
        if is_readme_sheet(name): 
            continue
        for w, pat in pats.items():
            if m[w] is None and pat.search(str(name)):
                m[w] = name
    return m

def find_day0_column(df: pd.DataFrame):
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best, best_nonnull = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > best_nonnull:
            best, best_nonnull = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score:
            best, score = c, sc
    return best

def normalize_day0(s: pd.Series) -> pd.Series:
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series) -> pd.Series:
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    """Mean test coefficient of determination using group folds when possible; row folds fallback if too few groups."""
    n_groups = int(pd.Series(groups).nunique())
    if n_groups >= 2:
        n_splits = min(max_folds, n_groups)
        gkf = GroupKFold(n_splits=n_splits)
        scores = []
        mdl = LinearRegression()
        for tr, te in gkf.split(X, y, groups=groups):
            mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
            y_pred = mdl.predict(X.iloc[te].values)
            y_true = y.iloc[te].values
            ss_res = np.sum((y_true - y_pred)**2)
            ss_tot = np.sum((y_true - np.mean(y_true))**2)
            scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
        return float(np.nanmean(scores))
    # fallback to ordinary KFold on rows
    n = len(X)
    if n < 3:
        return np.nan
    n_splits = min(3, n)
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    scores = []
    mdl = LinearRegression()
    for tr, te in kf.split(X, y):
        mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_pred = mdl.predict(X.iloc[te].values)
        y_true = y.iloc[te].values
        ss_res = np.sum((y_true - y_pred)**2)
        ss_tot = np.sum((y_true - np.mean(y_true))**2)
        scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# ---------- LOAD EVENT ----------
evt_path = find_file(EVENT_FILE)
if evt_path is None:
    raise FileNotFoundError("event_study.xlsx not found in the configured folders.")
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map = find_event_window_sheets(evt_book)

# ---------- RUN ----------
present = [f for f in FEATURE_FILES if find_file(f) is not None]
assert present, "Could not find features v1.2.xlsx or features v1.5.xlsx."

print("Testing files:", present)

merge_audit = []
results = []

for fname in present:
    fpath = find_file(fname)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    df_feat_raw = feat_book[fsheet].copy()

    dfeat = find_day0_column(df_feat_raw)
    tfeat = find_ticker_column(df_feat_raw)
    feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)

    for w in WINDOWS:
        esheet = win_map.get(w)
        if esheet is None:
            print(f"Missing event sheet for window {w}. Skipping.")
            continue

        df_evt = evt_book[esheet].copy()
        devt = find_day0_column(df_evt)
        tevt = find_ticker_column(df_evt)
        ycol = find_target_col(df_evt)

        evt = df_evt.copy()
        evt["__day0__"]   = normalize_day0(evt[devt])
        evt["__ticker__"] = normalize_ticker(evt[tevt])
        evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

        merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        groups = merged["__ticker__"]
        X = build_X(merged, num_cols, ycol)
        y = merged[ycol].astype(float)

        merge_audit.append({
            "features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
            "day0_features_col": dfeat, "ticker_features_col": tfeat,
            "day0_event_col": devt, "ticker_event_col": tevt,
            "merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
        })

        m = fit_and_score(X, y, groups)
        m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
        results.append(m)

# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)

print("\nMerge audit:")
display(pd.DataFrame(merge_audit))

res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.5):")
display(res_df)

print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
                          columns="features_file",
                          values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
                          aggfunc="first")
display(wide)

# ---------- DELTAS (v1.5 minus v1.2) ----------
pairs = []
for w in WINDOWS:
    a = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
    b = res_df[(res_df["features_file"]=="features v1.5.xlsx") & (res_df["window"]==w)]
    if not a.empty and not b.empty:
        pairs.append({
            "window": w,
            "delta_cross_validated_r_squared": float(b["cross_validated_r_squared"].iloc[0] - a["cross_validated_r_squared"].iloc[0]),
            "delta_adjusted_r_squared": float(b["adjusted_r_squared"].iloc[0] - a["adjusted_r_squared"].iloc[0]),
            "delta_r_squared": float(b["r_squared"].iloc[0] - a["r_squared"].iloc[0]),
            "rows_used_v1.2": int(a["rows_used"].iloc[0]),
            "rows_used_v1.5": int(b["rows_used"].iloc[0]),
            "features_used_v1.2": int(a["features_used"].iloc[0]),
            "features_used_v1.5": int(b["features_used"].iloc[0]),
        })
if pairs:
    deltas = pd.DataFrame(pairs)
    print("\nDeltas (v1.5 minus v1.2) — positive is good:")
    display(deltas)

# Save CSVs next to your data
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "v1.2_vs_v1.5_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_vs_v1.5_comparison_table.csv")
print(f"\nSaved to: {out_dir}")
print(" - v1.2_vs_v1.5_results.csv")
print(" - v1.2_vs_v1.5_comparison_table.csv")
Testing files: ['features v1.2.xlsx', 'features v1.5.xlsx']

Merge audit:
features_file features_sheet event_sheet window day0_features_col ticker_features_col day0_event_col ticker_event_col merged_rows predictor_cols target_col
0 features v1.2.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 8 CAR
1 features v1.2.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 8 CAR
2 features v1.2.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 8 CAR
3 features v1.5.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 4 CAR
4 features v1.5.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 4 CAR
5 features v1.5.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 4 CAR
Results (v1.2 vs v1.5):
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared features_file features_sheet window
0 129 8 0.245160 0.194838 0.068034 features v1.2.xlsx features 0,1
1 129 4 0.098411 0.069328 -0.093031 features v1.5.xlsx features 0,1
2 129 8 0.201481 0.148246 0.094267 features v1.2.xlsx features 0,3
3 129 4 0.067332 0.037246 -0.095339 features v1.5.xlsx features 0,3
4 129 8 0.214735 0.162384 0.121771 features v1.2.xlsx features 0,5
5 129 4 0.064792 0.034624 -0.095392 features v1.5.xlsx features 0,5
Comparison table (rows = windows | columns = metrics per file):
adjusted_r_squared cross_validated_r_squared r_squared
features_file features v1.2.xlsx features v1.5.xlsx features v1.2.xlsx features v1.5.xlsx features v1.2.xlsx features v1.5.xlsx
window
0,1 0.194838 0.069328 0.068034 -0.093031 0.245160 0.098411
0,3 0.148246 0.037246 0.094267 -0.095339 0.201481 0.067332
0,5 0.162384 0.034624 0.121771 -0.095392 0.214735 0.064792
Deltas (v1.5 minus v1.2) — positive is good:
window delta_cross_validated_r_squared delta_adjusted_r_squared delta_r_squared rows_used_v1.2 rows_used_v1.5 features_used_v1.2 features_used_v1.5
0 0,1 -0.161065 -0.12551 -0.146749 129 129 8 4
1 0,3 -0.189606 -0.11100 -0.134149 129 129 8 4
2 0,5 -0.217163 -0.12776 -0.149943 129 129 8 4
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v1.2_vs_v1.5_results.csv
 - v1.2_vs_v1.5_comparison_table.csv
In [33]:
# === Find the best add-on features from v3 for a v1.2 baseline ===
# - Join on day0 + ticker
# - Baseline = v1.2 features only
# - Rank each v3 feature by one-at-a-time cross-validated coefficient of determination gain (delta)
# - Then do greedy forward add: keep adding v3 features while cross-validated coefficient of determination improves
# - Saves: top single gains, greedy path, summary
#
# If needed first:  pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# -------- CONFIG --------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data"),
]
EVENT_FILE = "event_study.xlsx"
BASE_FILE  = "features v1.2.xlsx"   # your base 8
POOL_FILE  = "features v3.xlsx"     # extra candidates
WINDOWS    = ["0,1","0,3","0,5"]

MAX_GROUP_FOLDS = 5      # grouped by ticker
MAX_ADDS        = 8      # try adding up to this many features (change if you want)
MIN_GAIN        = 0.01   # require at least this improvement in cross-validated coefficient of determination to keep a feature

# -------- HELPERS --------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists():
            return p
    raise FileNotFoundError(f"Could not find {name}")

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands:
        return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    out = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for nm in book:
        if is_readme_sheet(nm): 
            continue
        for w, pat in pats.items():
            if out[w] is None and pat.search(str(nm)):
                out[w] = nm
    return out

def find_day0_column(df: pd.DataFrame):
    cols = [str(c) for c in df.columns]
    strict = [c for c in cols if re.search(r"\bday[\s_]*0\b", c, flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest:
            best, kbest = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns:
            return c
    # guess
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score:
            best, score = c, sc
    return best

def normalize_day0(s: pd.Series):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series):
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = (df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
           .dropna(subset=["__day0__","__ticker__"]))
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def safe_group_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_splits=5):
    """Mean test coefficient of determination. Use GroupKFold by ticker. Fallback to KFold if needed."""
    n_groups = int(pd.Series(groups).nunique())
    model = LinearRegression()
    scores = []
    if n_groups >= 2:
        splits = GroupKFold(n_splits=min(max_splits, n_groups)).split(X, y, groups=groups)
    else:
        # row-wise fallback
        n = len(X)
        if n < 3: 
            return np.nan
        splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
    for tr, te in splits:
        model.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_hat = model.predict(X.iloc[te].values)
        y_true = y.iloc[te].values
        ss_res = np.sum((y_true - y_hat)**2)
        ss_tot = np.sum((y_true - np.mean(y_true))**2)
        scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores))

def in_sample_and_adjusted(X: pd.DataFrame, y: pd.Series):
    if X.shape[1] == 0:
        return np.nan, np.nan
    mdl = LinearRegression().fit(X.values, y.values)
    r2 = float(mdl.score(X.values, y.values))
    n, p = len(y), X.shape[1]
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    return r2, adj

# -------- LOAD --------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map  = find_event_window_sheets(evt_book)

base_book = pd.read_excel(find_file(BASE_FILE), sheet_name=None, engine="openpyxl")
base_sheet = choose_features_sheet(base_book)
base_raw   = base_book[base_sheet].copy()
b_day0 = find_day0_column(base_raw); b_tic = find_ticker_column(base_raw)
base_grp, base_num_cols = aggregate_features(base_raw, b_day0, b_tic)

pool_book = pd.read_excel(find_file(POOL_FILE), sheet_name=None, engine="openpyxl")
pool_sheet = choose_features_sheet(pool_book)
pool_raw   = pool_book[pool_sheet].copy()
p_day0 = find_day0_column(pool_raw); p_tic = find_ticker_column(pool_raw)
pool_grp, pool_num_cols = aggregate_features(pool_raw, p_day0, p_tic)

# -------- WORK PER WINDOW --------
all_single = []
all_paths  = []
all_summary= []

for window in WINDOWS:
    esheet = win_map.get(window)
    if esheet is None:
        print(f"Skip window {window}: event sheet not found.")
        continue

    df_evt = evt_book[esheet].copy()
    e_day0 = find_day0_column(df_evt); e_tic = find_ticker_column(df_evt); y_col = find_target_col(df_evt)
    evt = df_evt.copy()
    evt["__day0__"]   = normalize_day0(evt[e_day0])
    evt["__ticker__"] = normalize_ticker(evt[e_tic])
    evt = (evt.dropna(subset=["__day0__","__ticker__", y_col])
              .drop_duplicates(subset=["__day0__","__ticker__"]))

    # Merge base + event
    merged_base = base_grp.merge(evt[["__day0__","__ticker__", y_col]], on=["__day0__","__ticker__"], how="inner")
    X_base_all = build_X(merged_base, base_num_cols, y_col)
    y = merged_base[y_col].astype(float)
    groups = merged_base["__ticker__"]

    # Remember which base columns actually survived cleaning
    base_used = list(X_base_all.columns)

    # Baseline scores
    base_cv, base_adj = safe_group_cv_r2(X_base_all, y, groups, max_splits=MAX_GROUP_FOLDS), in_sample_and_adjusted(X_base_all, y)[1]
    base_r2 = in_sample_and_adjusted(X_base_all, y)[0]

    # Build a single merged table with pool candidates aligned
    pool_only_cols = [c for c in pool_num_cols if c not in base_used]
    merged_pool = pool_grp[["__day0__","__ticker__"] + pool_only_cols]
    merged = merged_base.merge(merged_pool, on=["__day0__","__ticker__"], how="left")

    # ---- ONE-AT-A-TIME RANKING ----
    rows = []
    for c in pool_only_cols:
        if c not in merged.columns: 
            continue
        Xt = pd.concat([X_base_all, merged[[c]]], axis=1)
        data = pd.concat([y, Xt], axis=1).dropna()
        y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
        if X_c.shape[1] == 0 or len(y_c) < 10: 
            continue
        cv = safe_group_cv_r2(X_c, y_c, groups.loc[X_c.index], max_splits=MAX_GROUP_FOLDS)
        rows.append({"window": window, "feature": c,
                     "cv_with_feature": cv, "delta": cv - base_cv,
                     "rows_used": len(y_c)})
    single_df = pd.DataFrame(rows).sort_values("delta", ascending=False).reset_index(drop=True)
    all_single.append(single_df)

    # ---- GREEDY FORWARD ADD (start from base, add v3 features while cross-validated coefficient of determination rises) ----
    added = []
    current_cols = base_used.copy()
    current_cv = base_cv
    path_rows = []
    for step in range(MAX_ADDS):
        best_gain, best_feat, best_cv = 0.0, None, None
        for c in pool_only_cols:
            if c in added: 
                continue
            Xt = pd.concat([merged[current_cols], merged[[c]]], axis=1)
            data = pd.concat([y, Xt], axis=1).dropna()
            y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
            if X_c.shape[1] == 0 or len(y_c) < 10: 
                continue
            cv = safe_group_cv_r2(X_c, y_c, groups.loc[X_c.index], max_splits=MAX_GROUP_FOLDS)
            gain = cv - current_cv
            if gain > best_gain:
                best_gain, best_feat, best_cv = gain, c, cv
        if best_feat is None or best_gain < MIN_GAIN:
            break
        added.append(best_feat)
        current_cols.append(best_feat)
        current_cv = best_cv
        r2_now, adj_now = in_sample_and_adjusted(merged[current_cols].dropna(), y.loc[merged[current_cols].dropna().index])
        path_rows.append({"window": window, "step": len(added), "added_feature": best_feat,
                          "cv_r_squared": current_cv, "gain": best_gain,
                          "r_squared_in_sample": r2_now, "adjusted_r_squared_in_sample": adj_now})

    path_df = pd.DataFrame(path_rows)
    all_paths.append(path_df)

    all_summary.append({
        "window": window,
        "base_used_features": ", ".join(base_used),
        "baseline_cross_validated_r_squared": base_cv,
        "baseline_r_squared": base_r2,
        "baseline_adjusted_r_squared": base_adj,
        "n_added_from_v3": len(added),
        "added_features": ", ".join(added),
        "final_cross_validated_r_squared": current_cv,
        "improvement_vs_baseline": current_cv - base_cv
    })

# -------- REPORT + SAVE --------
single_all = pd.concat(all_single, ignore_index=True) if all_single else pd.DataFrame()
path_all   = pd.concat(all_paths,  ignore_index=True) if all_paths  else pd.DataFrame()
summary    = pd.DataFrame(all_summary)

pd.set_option("display.max_rows", 200)
pd.set_option("display.max_columns", None)

print("\n=== Summary (per window) ===")
display(summary.sort_values("window"))

if not single_all.empty:
    print("\n=== Top single add-on features from v3 (by delta cross-validated coefficient of determination) ===")
    display(single_all.groupby("window").head(25))

if not path_all.empty:
    print("\n=== Greedy forward add path (what we would add in order) ===")
    display(path_all)

out_dir = find_file(EVENT_FILE).parent
summary.to_csv(out_dir / "v12_addfromv3_summary.csv", index=False)
if not single_all.empty: single_all.to_csv(out_dir / "v12_addfromv3_top_single_gains.csv", index=False)
if not path_all.empty:   path_all.to_csv(out_dir / "v12_addfromv3_greedy_path.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v12_addfromv3_summary.csv")
print(" - v12_addfromv3_top_single_gains.csv")
print(" - v12_addfromv3_greedy_path.csv")
=== Summary (per window) ===
window base_used_features baseline_cross_validated_r_squared baseline_r_squared baseline_adjusted_r_squared n_added_from_v3 added_features final_cross_validated_r_squared improvement_vs_baseline
0 0,1 eps_surprise_pct, pre_ret_3d, pre_ret_5d, pre_... 0.068034 0.245160 0.194838 1 macro_cpi_yoy 0.081215 0.013181
1 0,3 eps_surprise_pct, pre_ret_3d, pre_ret_5d, pre_... 0.094267 0.201481 0.148246 0 0.094267 0.000000
2 0,5 eps_surprise_pct, pre_ret_3d, pre_ret_5d, pre_... 0.121771 0.214735 0.162384 0 0.121771 0.000000
=== Top single add-on features from v3 (by delta cross-validated coefficient of determination) ===
window feature cv_with_feature delta rows_used
0 0,1 macro_cpi_yoy 0.081215 1.318111e-02 129
1 0,1 pre_vol_10d 0.072635 4.601765e-03 129
2 0,1 rates_x_surprise 0.069896 1.862384e-03 129
3 0,1 is_amc 0.068034 -8.881784e-16 129
4 0,1 is_bmo 0.068034 -8.881784e-16 129
5 0,1 is_friday 0.068034 -8.881784e-16 129
6 0,1 pre_ret_10d 0.065582 -2.451491e-03 129
7 0,1 cpi_x_prevol5d 0.063162 -4.871555e-03 129
8 0,1 is_monday 0.060641 -7.392425e-03 129
9 0,1 vix_chg_10d_lag1 0.060394 -7.639289e-03 129
10 0,1 high_vix_regime 0.057423 -1.061060e-02 129
11 0,1 cpi_x_surprise 0.054169 -1.386500e-02 129
12 0,1 is_january 0.051872 -1.616123e-02 129
13 0,1 investment_grade_option_adjusted_spread_bp 0.050660 -1.737398e-02 129
14 0,1 investment_grade_option_adjusted_spread_pct 0.050660 -1.737398e-02 129
15 0,1 macro_fedfunds 0.050613 -1.742084e-02 129
16 0,1 high_rates_regime 0.049974 -1.805981e-02 129
17 0,1 pre_vol_3d 0.048157 -1.987686e-02 129
18 0,1 weekly_density 0.048107 -1.992688e-02 129
19 0,1 high_density_week 0.048107 -1.992688e-02 129
20 0,1 month 0.046575 -2.145911e-02 129
21 0,1 baa_minus_aaa_bp 0.044385 -2.364825e-02 129
22 0,1 baa_minus_aaa_pct 0.044385 -2.364825e-02 129
23 0,1 mkt_ret_1d_lag1 0.041337 -2.669715e-02 129
24 0,1 quarter 0.040292 -2.774160e-02 129
36 0,3 macro_cpi_yoy 0.096439 2.172138e-03 129
37 0,3 is_amc 0.094267 6.383782e-16 129
38 0,3 is_friday 0.094267 6.383782e-16 129
39 0,3 is_bmo 0.094267 6.383782e-16 129
40 0,3 pre_vol_10d 0.093876 -3.911947e-04 129
41 0,3 is_monday 0.093087 -1.179961e-03 129
42 0,3 macro_fedfunds 0.090032 -4.235049e-03 129
43 0,3 rates_x_surprise 0.089324 -4.943540e-03 129
44 0,3 cpi_x_prevol5d 0.088516 -5.751032e-03 129
45 0,3 investment_grade_option_adjusted_spread_bp 0.088365 -5.902608e-03 129
46 0,3 investment_grade_option_adjusted_spread_pct 0.088365 -5.902608e-03 129
47 0,3 cpi_x_surprise 0.080796 -1.347118e-02 129
48 0,3 pre_vol_3d 0.080411 -1.385620e-02 129
49 0,3 high_rates_regime 0.079143 -1.512391e-02 129
50 0,3 is_january 0.078816 -1.545143e-02 129
51 0,3 high_vix_regime 0.077928 -1.633957e-02 129
52 0,3 pre_ret_10d 0.077158 -1.710937e-02 129
53 0,3 baa_minus_aaa_pct 0.076349 -1.791780e-02 129
54 0,3 baa_minus_aaa_bp 0.076349 -1.791780e-02 129
55 0,3 weekly_density 0.074380 -1.988693e-02 129
56 0,3 high_density_week 0.074380 -1.988693e-02 129
57 0,3 month 0.068821 -2.544574e-02 129
58 0,3 mkt_ret_1d_lag1 0.067033 -2.723425e-02 129
59 0,3 quarter 0.059743 -3.452387e-02 129
60 0,3 day_of_week 0.054746 -3.952102e-02 129
72 0,5 is_amc 0.121771 1.221245e-15 129
73 0,5 is_bmo 0.121771 1.221245e-15 129
74 0,5 is_friday 0.121771 1.221245e-15 129
75 0,5 pre_vol_10d 0.120651 -1.120119e-03 129
76 0,5 macro_fedfunds 0.119332 -2.438700e-03 129
77 0,5 macro_cpi_yoy 0.117592 -4.178903e-03 129
78 0,5 high_density_week 0.116452 -5.318717e-03 129
79 0,5 weekly_density 0.116452 -5.318717e-03 129
80 0,5 cpi_x_prevol5d 0.115113 -6.657469e-03 129
81 0,5 is_monday 0.114571 -7.199965e-03 129
82 0,5 investment_grade_option_adjusted_spread_bp 0.113210 -8.560419e-03 129
83 0,5 investment_grade_option_adjusted_spread_pct 0.113210 -8.560419e-03 129
84 0,5 high_vix_regime 0.111899 -9.871706e-03 129
85 0,5 high_rates_regime 0.111492 -1.027892e-02 129
86 0,5 baa_minus_aaa_bp 0.110206 -1.156436e-02 129
87 0,5 baa_minus_aaa_pct 0.110206 -1.156436e-02 129
88 0,5 is_january 0.108703 -1.306799e-02 129
89 0,5 pre_ret_10d 0.106843 -1.492733e-02 129
90 0,5 pre_vol_3d 0.105928 -1.584277e-02 129
91 0,5 month 0.101969 -1.980198e-02 129
92 0,5 rates_x_surprise 0.097229 -2.454192e-02 129
93 0,5 mkt_ret_1d_lag1 0.097109 -2.466205e-02 129
94 0,5 cpi_x_surprise 0.095834 -2.593715e-02 129
95 0,5 quarter 0.092512 -2.925895e-02 129
96 0,5 vix_chg_10d_lag1 0.091427 -3.034321e-02 129
=== Greedy forward add path (what we would add in order) ===
window step added_feature cv_r_squared gain r_squared_in_sample adjusted_r_squared_in_sample
0 0,1 1 macro_cpi_yoy 0.081215 0.013181 0.262488 0.20671
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v12_addfromv3_summary.csv
 - v12_addfromv3_top_single_gains.csv
 - v12_addfromv3_greedy_path.csv
In [35]:
# === Compare features v1.2 vs v1.3 vs v1.4 (join on day0 + ticker) ===
# Metrics shown per window: R^2, Adjusted R^2, Cross-validated R^2 (grouped by ticker)
# If needed first:  pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.3.xlsx", "features v1.4.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(f"Could not find {name}")

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands: return next(iter(book))
    # pick sheet with the most numeric columns (then most rows)
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    out = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for nm in book:
        if is_readme_sheet(nm): continue
        for w, pat in pats.items():
            if out[w] is None and pat.search(str(nm)): out[w] = nm
    return out

def find_day0_column(df: pd.DataFrame):
    strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    # fallback: most date-like
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest: best, kbest = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    # fallback: best-looking object column
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score: best, score = c, sc
    return best

def normalize_day0(s: pd.Series):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series):
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    """Mean test coefficient of determination with GroupKFold by ticker; KFold fallback if too few tickers."""
    n_groups = int(pd.Series(groups).nunique())
    mdl = LinearRegression()
    scores = []
    if n_groups >= 2:
        gkf = GroupKFold(n_splits=min(max_folds, n_groups))
        splits = gkf.split(X, y, groups=groups)
    else:
        n = len(X)
        if n < 3: return np.nan
        splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
    for tr, te in splits:
        mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_hat = mdl.predict(X.iloc[te].values)
        y_true = y.iloc[te].values
        ss_res = np.sum((y_true - y_hat)**2)
        ss_tot = np.sum((y_true - np.mean(y_true))**2)
        scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    cv = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# ---------- LOAD ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map  = find_event_window_sheets(evt_book)

present = [f for f in FEATURE_FILES if (BASE_DIRS[0]/f).exists() or (Path(".")/f).exists() or (Path("/mnt/data")/f).exists()]
assert present, "None of the features files were found."

merge_audit = []
results = []

for fname in present:
    fpath = find_file(fname)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    df_feat_raw = feat_book[fsheet].copy()

    dfeat = find_day0_column(df_feat_raw)
    tfeat = find_ticker_column(df_feat_raw)
    feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)

    for w in WINDOWS:
        esheet = win_map.get(w)
        if esheet is None:
            print(f"Missing event sheet for window {w}. Skipping.")
            continue

        df_evt = evt_book[esheet].copy()
        devt = find_day0_column(df_evt)
        tevt = find_ticker_column(df_evt)
        ycol = find_target_col(df_evt)

        evt = df_evt.copy()
        evt["__day0__"]   = normalize_day0(evt[devt])
        evt["__ticker__"] = normalize_ticker(evt[tevt])
        evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

        merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        groups = merged["__ticker__"]
        X = build_X(merged, num_cols, ycol)
        y = merged[ycol].astype(float)

        merge_audit.append({
            "features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
            "day0_features_col": dfeat, "ticker_features_col": tfeat,
            "day0_event_col": devt, "ticker_event_col": tevt,
            "merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
        })

        m = fit_and_score(X, y, groups)
        m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
        results.append(m)

# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)

print("\nMerge audit:")
display(pd.DataFrame(merge_audit))

res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.3 vs v1.4):")
display(res_df)

print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
                          columns="features_file",
                          values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
                          aggfunc="first")
display(wide)

# ---------- DELTAS vs baseline v1.2 ----------
pairs = []
for w in WINDOWS:
    base = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
    for alt in ["features v1.3.xlsx", "features v1.4.xlsx"]:
        comp = res_df[(res_df["features_file"]==alt) & (res_df["window"]==w)]
        if not base.empty and not comp.empty:
            pairs.append({
                "window": w,
                "model_vs_v1.2": alt,
                "delta_cross_validated_r_squared": float(comp["cross_validated_r_squared"].iloc[0] - base["cross_validated_r_squared"].iloc[0]),
                "delta_adjusted_r_squared": float(comp["adjusted_r_squared"].iloc[0] - base["adjusted_r_squared"].iloc[0]),
                "delta_r_squared": float(comp["r_squared"].iloc[0] - base["r_squared"].iloc[0]),
                "rows_used_base": int(base["rows_used"].iloc[0]),
                "rows_used_alt": int(comp["rows_used"].iloc[0]),
                "features_used_base": int(base["features_used"].iloc[0]),
                "features_used_alt": int(comp["features_used"].iloc[0]),
            })
if pairs:
    deltas = pd.DataFrame(pairs).sort_values(["window","model_vs_v1.2"]).reset_index(drop=True)
    print("\nDeltas vs v1.2 — positive is good:")
    display(deltas)

# ---------- SAVE ----------
out_dir = find_file(EVENT_FILE).parent
res_df.to_csv(out_dir / "v1.2_v1.3_v1.4_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_v1.3_v1.4_comparison_table.csv")
if pairs:
    deltas.to_csv(out_dir / "v1.2_v1.3_v1.4_deltas_vs_v12.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v1.2_v1.3_v1.4_results.csv")
print(" - v1.2_v1.3_v1.4_comparison_table.csv")
print(" - v1.2_v1.3_v1.4_deltas_vs_v12.csv")
Merge audit:
features_file features_sheet event_sheet window day0_features_col ticker_features_col day0_event_col ticker_event_col merged_rows predictor_cols target_col
0 features v1.2.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 8 CAR
1 features v1.2.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 8 CAR
2 features v1.2.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 8 CAR
3 features v1.3.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 9 CAR
4 features v1.3.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 9 CAR
5 features v1.3.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 9 CAR
6 features v1.4.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 9 CAR
7 features v1.4.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 9 CAR
8 features v1.4.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 9 CAR
Results (v1.2 vs v1.3 vs v1.4):
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared features_file features_sheet window
0 129 8 0.245160 0.194838 0.068034 features v1.2.xlsx features 0,1
1 129 9 0.251629 0.195030 0.037240 features v1.3.xlsx features 0,1
2 129 9 0.257667 0.201524 0.045910 features v1.4.xlsx features 0,1
3 129 8 0.201481 0.148246 0.094267 features v1.2.xlsx features 0,3
4 129 9 0.212109 0.152520 0.018953 features v1.3.xlsx features 0,3
5 129 9 0.208601 0.148747 0.054705 features v1.4.xlsx features 0,3
6 129 8 0.214735 0.162384 0.121771 features v1.2.xlsx features 0,5
7 129 9 0.221031 0.162118 0.031952 features v1.3.xlsx features 0,5
8 129 9 0.215733 0.156419 0.082984 features v1.4.xlsx features 0,5
Comparison table (rows = windows | columns = metrics per file):
adjusted_r_squared cross_validated_r_squared r_squared
features_file features v1.2.xlsx features v1.3.xlsx features v1.4.xlsx features v1.2.xlsx features v1.3.xlsx features v1.4.xlsx features v1.2.xlsx features v1.3.xlsx features v1.4.xlsx
window
0,1 0.194838 0.195030 0.201524 0.068034 0.037240 0.045910 0.245160 0.251629 0.257667
0,3 0.148246 0.152520 0.148747 0.094267 0.018953 0.054705 0.201481 0.212109 0.208601
0,5 0.162384 0.162118 0.156419 0.121771 0.031952 0.082984 0.214735 0.221031 0.215733
Deltas vs v1.2 — positive is good:
window model_vs_v1.2 delta_cross_validated_r_squared delta_adjusted_r_squared delta_r_squared rows_used_base rows_used_alt features_used_base features_used_alt
0 0,1 features v1.3.xlsx -0.030793 0.000192 0.006469 129 129 8 9
1 0,1 features v1.4.xlsx -0.022123 0.006686 0.012506 129 129 8 9
2 0,3 features v1.3.xlsx -0.075314 0.004274 0.010628 129 129 8 9
3 0,3 features v1.4.xlsx -0.039562 0.000500 0.007120 129 129 8 9
4 0,5 features v1.3.xlsx -0.089819 -0.000266 0.006297 129 129 8 9
5 0,5 features v1.4.xlsx -0.038787 -0.005965 0.000998 129 129 8 9
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v1.2_v1.3_v1.4_results.csv
 - v1.2_v1.3_v1.4_comparison_table.csv
 - v1.2_v1.3_v1.4_deltas_vs_v12.csv
In [41]:
# === Audit: does "macro CPI YoY" from v3 lift cross validated coefficient of determination? ===
# Compares per window: v1.2 baseline, v1.2 + macro from v3, v1.3 file.
# Also checks whether v1.3's macro column equals the v3 macro values after join on day0 + ticker.

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ----------------- CONFIG -----------------
SEARCH_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
BASE_FILE  = "features v1.2.xlsx"
V13_FILE   = "features v1.3.xlsx"
V3_FILE    = "features v3.xlsx"
WINDOWS    = ["0,1", "0,3", "0,5"]
MAX_FOLDS  = 5

# ----------------- HELPERS -----------------
def find_file(name: str) -> Path:
    for d in SEARCH_DIRS:
        p = d / name
        if p.exists():
            return p
    raise FileNotFoundError(f"Could not find: {name}")

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    # pick non-readme sheet with most numeric columns, then most rows
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands:
        return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def window_sheets(book: dict):
    out = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for nm in book:
        if is_readme_sheet(nm): 
            continue
        for w, pat in pats.items():
            if out[w] is None and pat.search(str(nm)):
                out[w] = nm
    return out

def find_day0(df: pd.DataFrame):
    strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    # fallback: most date-like column
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest:
            best, kbest = c, k
    return best

def find_ticker(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    # fallback: best object column by uniqueness
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score:
            best, score = c, sc
    return best

def find_target(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def norm_day0(s: pd.Series):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def norm_ticker(s: pd.Series):
    return s.astype(str).str.strip().str.upper()

def group_numeric(df: pd.DataFrame, day0_col: str, tic_col: str):
    g = df.copy()
    g["__day0__"] = norm_day0(g[day0_col])
    g["__tic__"]  = norm_ticker(g[tic_col])
    num = g.select_dtypes(include=[np.number]).columns.tolist()
    g = (g.groupby(["__day0__","__tic__"], as_index=False)[num].mean()
           .dropna(subset=["__day0__","__tic__"]))
    return g, num

def build_X(merged: pd.DataFrame, cols: list, ycol: str):
    keep = [c for c in cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[ycol], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    n_groups = int(pd.Series(groups).nunique())
    mdl = LinearRegression()
    scores = []
    if n_groups >= 2:
        splitter = GroupKFold(n_splits=min(max_folds, n_groups))
        splits = splitter.split(X, y, groups=groups)
    else:
        n = len(X)
        if n < 3: return np.nan
        splitter = KFold(n_splits=min(3, n), shuffle=True, random_state=42)
        splits = splitter.split(X, y)
    for tr, te in splits:
        mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
        pred = mdl.predict(X.iloc[te].values)
        true = y.iloc[te].values
        ss_res = np.sum((true - pred)**2)
        ss_tot = np.sum((true - np.mean(true))**2)
        scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores))

def r2_and_adjusted(X: pd.DataFrame, y: pd.Series):
    mdl = LinearRegression().fit(X.values, y.values)
    r2 = float(mdl.score(X.values, y.values))
    n, p = len(y), X.shape[1]
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    return r2, adj

def find_macro_cpi_yoy(cols: list) -> str | None:
    # Flexible match for names like "macro_cpi_yoy", "Macro CPI YoY", etc.
    for c in cols:
        s = re.sub(r"[^a-z0-9]+", "", str(c).lower())
        if "macro" in s and "cpi" in s and ("yoy" in s or "yearover" in s or "yoy" in s):
            return c
    # second pass: contains "cpi" and "yoy"
    for c in cols:
        s = str(c).lower()
        if "cpi" in s and "yoy" in s:
            return c
    return None

# ----------------- LOAD FILES -----------------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map  = window_sheets(evt_book)

b_book = pd.read_excel(find_file(BASE_FILE), sheet_name=None, engine="openpyxl")
b_sheet = choose_features_sheet(b_book); b_raw = b_book[b_sheet].copy()
b_day0 = find_day0(b_raw); b_tic = find_ticker(b_raw)
b_grp, b_cols = group_numeric(b_raw, b_day0, b_tic)

v3_book = pd.read_excel(find_file(V3_FILE), sheet_name=None, engine="openpyxl")
v3_sheet = choose_features_sheet(v3_book); v3_raw = v3_book[v3_sheet].copy()
v3_day0 = find_day0(v3_raw); v3_tic = find_ticker(v3_raw)
v3_grp, v3_cols = group_numeric(v3_raw, v3_day0, v3_tic)

v13_book = pd.read_excel(find_file(V13_FILE), sheet_name=None, engine="openpyxl")
v13_sheet = choose_features_sheet(v13_book); v13_raw = v13_book[v13_sheet].copy()
v13_day0 = find_day0(v13_raw); v13_tic = find_ticker(v13_raw)
v13_grp, v13_cols = group_numeric(v13_raw, v13_day0, v13_tic)

macro_v3_col = find_macro_cpi_yoy(v3_cols)
macro_v13_col = find_macro_cpi_yoy(v13_cols)

if macro_v3_col is None:
    raise ValueError("Could not find a 'macro CPI YoY' column in v3.")

# ----------------- RUN PER WINDOW -----------------
rows = []
checks = []

for w in WINDOWS:
    es = win_map.get(w)
    if es is None:
        print(f"Skip {w}: no event sheet.")
        continue

    ev = evt_book[es].copy()
    e_day0 = find_day0(ev); e_tic = find_ticker(ev); ycol = find_target(ev)

    ev["__day0__"] = norm_day0(ev[e_day0])
    ev["__tic__"]  = norm_ticker(ev[e_tic])
    ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])

    # --- v1.2 baseline ---
    mb = b_grp.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")
    X_base = build_X(mb, b_cols, ycol)
    y = mb[ycol].astype(float)
    groups = mb["__tic__"]
    base_cv = cv_r2(X_base, y, groups, MAX_FOLDS)
    base_r2, base_adj = r2_and_adjusted(X_base, y)

    # --- v1.2 + macro column from v3 ---
    macro_from_v3 = v3_grp[["__day0__","__tic__", macro_v3_col]].rename(columns={macro_v3_col:"macro_v3"})
    mbm = mb.merge(macro_from_v3, on=["__day0__","__tic__"], how="left")
    X_plus = pd.concat([X_base, mbm[["macro_v3"]]], axis=1)
    data_plus = pd.concat([y, X_plus], axis=1).dropna()
    y_plus, X_plus_c = data_plus.iloc[:,0], data_plus.iloc[:,1:]
    plus_cv = cv_r2(X_plus_c, y_plus, groups.loc[X_plus_c.index], MAX_FOLDS)

    # --- v1.3 file as-is ---
    mv13 = v13_grp.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")
    X_13 = build_X(mv13, v13_cols, ycol)
    y_13 = mv13[ycol].astype(float)
    g_13 = mv13["__tic__"]
    v13_cv = cv_r2(X_13, y_13, g_13, MAX_FOLDS)

    rows.append({
        "window": w,
        "rows_used": int(len(X_base)),
        "features_used_v12": int(X_base.shape[1]),
        "base_cross_validated_r_squared": float(base_cv),
        "base_r_squared": float(base_r2),
        "base_adjusted_r_squared": float(base_adj),
        "v12_plus_macro_cross_validated_r_squared": float(plus_cv),
        "delta_plus_vs_base": float(plus_cv - base_cv),
        "v13_cross_validated_r_squared": float(v13_cv),
        "features_used_v13": int(X_13.shape[1]),
    })

    # --- macro value equality check (v3 vs v1.3 on the same rows) ---
    if macro_v13_col and macro_v13_col in mv13.columns:
        macro_from_v13 = mv13[["__day0__","__tic__", macro_v13_col]].rename(columns={macro_v13_col:"macro_v13"})
        join_macro = (
            macro_from_v3
            .merge(macro_from_v13, on=["__day0__","__tic__"], how="inner")
            .dropna(subset=["macro_v3","macro_v13"])
        )
        if len(join_macro) > 0:
            share_equal = (join_macro["macro_v3"].round(10) == join_macro["macro_v13"].round(10)).mean()
            mean_abs_diff = (join_macro["macro_v3"] - join_macro["macro_v13"]).abs().mean()
            corr = np.corrcoef(join_macro["macro_v3"], join_macro["macro_v13"])[0,1] if len(join_macro) > 2 else np.nan
        else:
            share_equal, mean_abs_diff, corr = np.nan, np.nan, np.nan
        checks.append({
            "window": w,
            "macro_rows_overlap": int(len(join_macro)),
            "share_exact_equal": float(share_equal) if pd.notna(share_equal) else np.nan,
            "mean_abs_diff": float(mean_abs_diff) if pd.notna(mean_abs_diff) else np.nan,
            "corr_v3_vs_v13": float(corr) if pd.notna(corr) else np.nan
        })
    else:
        checks.append({
            "window": w,
            "macro_rows_overlap": 0,
            "share_exact_equal": np.nan,
            "mean_abs_diff": np.nan,
            "corr_v3_vs_v13": np.nan
        })

# ----------------- SHOW + SAVE -----------------
scores = pd.DataFrame(rows).sort_values("window").reset_index(drop=True)
macro_check = pd.DataFrame(checks).sort_values("window").reset_index(drop=True)

pd.set_option("display.max_columns", None)
print("\n=== Scores per window ===")
display(scores)

print("\n=== Macro CPI YoY equality check (v3 vs v1.3) ===")
display(macro_check)

# Save next to the event file
out_dir = find_file(EVENT_FILE).parent
scores.to_csv(out_dir / "audit_v12_vs_v12plusmacro_vs_v13_scores.csv", index=False)
macro_check.to_csv(out_dir / "audit_macro_value_check.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - audit_v12_vs_v12plusmacro_vs_v13_scores.csv")
print(" - audit_macro_value_check.csv")
=== Scores per window ===
window rows_used features_used_v12 base_cross_validated_r_squared base_r_squared base_adjusted_r_squared v12_plus_macro_cross_validated_r_squared delta_plus_vs_base v13_cross_validated_r_squared features_used_v13
0 0,1 129 8 0.068034 0.245160 0.194838 0.081215 0.013181 0.037240 9
1 0,3 129 8 0.094267 0.201481 0.148246 0.096439 0.002172 0.018953 9
2 0,5 129 8 0.121771 0.214735 0.162384 0.117592 -0.004179 0.031952 9
=== Macro CPI YoY equality check (v3 vs v1.3) ===
window macro_rows_overlap share_exact_equal mean_abs_diff corr_v3_vs_v13
0 0,1 129 0.023256 2.485246 -0.217281
1 0,3 129 0.023256 2.485246 -0.217281
2 0,5 129 0.023256 2.485246 -0.217281
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - audit_v12_vs_v12plusmacro_vs_v13_scores.csv
 - audit_macro_value_check.csv
In [45]:
# === Evaluate features v1.2 only (windows 0,1 / 0,3 / 0,5) ===
# Metrics: R^2, Adjusted R^2, Cross-validated R^2 (grouped by ticker; safe fallback)
# If needed first:  pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE  = "event_study.xlsx"
FEATURES_12 = "features v1.2.xlsx"
WINDOWS     = ["0,1", "0,3", "0,5"]
MAX_GROUP_FOLDS = 5

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(f"Could not find: {name}")

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands: return next(iter(book))
    def score(item):
        n, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    out = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for nm in book:
        if is_readme_sheet(nm): continue
        for w, pat in pats.items():
            if out[w] is None and pat.search(str(nm)): out[w] = nm
    return out

def find_day0_column(df: pd.DataFrame):
    strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    # most date-like
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest: best, kbest = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score: best, score = c, sc
    return best

def normalize_day0(s: pd.Series):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series):
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    n_groups = int(pd.Series(groups).nunique())
    mdl = LinearRegression()
    scores = []
    if n_groups >= 2:
        gkf = GroupKFold(n_splits=min(max_folds, n_groups))
        splits = gkf.split(X, y, groups=groups)
    else:
        n = len(X)
        if n < 3: return np.nan
        splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
    for tr, te in splits:
        mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_hat = mdl.predict(X.iloc[te].values)
        y_true = y.iloc[te].values
        ss_res = np.sum((y_true - y_hat)**2)
        ss_tot = np.sum((y_true - np.mean(y_true))**2)
        scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    cv  = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# ---------- LOAD ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map  = find_event_window_sheets(evt_book)

f12_path = find_file(FEATURES_12)
f12_book = pd.read_excel(f12_path, sheet_name=None, engine="openpyxl")
f12_sheet = choose_features_sheet(f12_book)
f12_raw   = f12_book[f12_sheet].copy()
f12_day0  = find_day0_column(f12_raw)
f12_tic   = find_ticker_column(f12_raw)
f12_grp, f12_num_cols = aggregate_features(f12_raw, f12_day0, f12_tic)

# ---------- RUN ----------
rows = []
merge_audit = []

for w in WINDOWS:
    esheet = win_map.get(w)
    if esheet is None:
        print(f"Skip window {w}: event sheet not found.")
        continue

    ev = evt_book[esheet].copy()
    e_day0 = find_day0_column(ev); e_tic = find_ticker_column(ev); ycol = find_target_col(ev)
    ev["__day0__"]   = normalize_day0(ev[e_day0])
    ev["__ticker__"] = normalize_ticker(ev[e_tic])
    ev = ev.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

    merged = f12_grp.merge(ev[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
    groups = merged["__ticker__"]
    X = build_X(merged, f12_num_cols, ycol)
    y = merged[ycol].astype(float)

    merge_audit.append({
        "window": w,
        "features_sheet": f12_sheet,
        "event_sheet": esheet,
        "day0_features_col": f12_day0,
        "ticker_features_col": f12_tic,
        "day0_event_col": e_day0,
        "ticker_event_col": e_tic,
        "merged_rows": len(merged),
        "predictor_cols_after_clean": X.shape[1],
        "target_col": ycol,
        "features_used": ", ".join(list(X.columns))
    })

    m = fit_and_score(X, y, groups)
    m.update(dict(window=w))
    rows.append(m)

# ---------- SHOW ----------
audit_df = pd.DataFrame(merge_audit)
res_df   = pd.DataFrame(rows).sort_values("window").reset_index(drop=True)

pd.set_option("display.max_columns", None)
print("\nMerge audit for v1.2:")
display(audit_df)

print("\nResults for v1.2 only:")
display(res_df)

# ---------- SAVE ----------
out_dir = evt_path.parent
res_df.to_csv(out_dir / "v12_results_only.csv", index=False)
audit_df.to_csv(out_dir / "v12_merge_audit.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v12_results_only.csv")
print(" - v12_merge_audit.csv")
Merge audit for v1.2:
window features_sheet event_sheet day0_features_col ticker_features_col day0_event_col ticker_event_col merged_rows predictor_cols_after_clean target_col features_used
0 0,1 features CAR_(0,1) day0 ticker day0 ticker 129 8 CAR eps_surprise_pct, pre_ret_3d, pre_ret_5d, pre_...
1 0,3 features CAR_(0,3) day0 ticker day0 ticker 129 8 CAR eps_surprise_pct, pre_ret_3d, pre_ret_5d, pre_...
2 0,5 features CAR_(0,5) day0 ticker day0 ticker 129 8 CAR eps_surprise_pct, pre_ret_3d, pre_ret_5d, pre_...
Results for v1.2 only:
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared window
0 129 8 0.245160 0.194838 0.068034 0,1
1 129 8 0.201481 0.148246 0.094267 0,3
2 129 8 0.214735 0.162384 0.121771 0,5
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v12_results_only.csv
 - v12_merge_audit.csv
In [48]:
# === Final check: features v1.xlsx vs features v1.2.xlsx (join on day0 + ticker) ===
# Metrics per window: R^2, Adjusted R^2, Cross-validated R^2 (grouped by ticker; safe fallback)
# Saves: results, wide comparison, and deltas vs v1.2
#
# If needed first:  pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.xlsx", "features v1.2.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(f"Could not find {name}")

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    # choose non-readme sheet with most numeric columns (then most rows)
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands: return next(iter(book))
    def score(item):
        _, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    out = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for nm in book:
        if is_readme_sheet(nm): continue
        for w, pat in pats.items():
            if out[w] is None and pat.search(str(nm)): out[w] = nm
    return out

def find_day0_column(df: pd.DataFrame):
    strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    # fallback: most date-like
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest: best, kbest = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    # fallback: best object col by uniqueness
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score: best, score = c, sc
    return best

def normalize_day0(s: pd.Series):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series):
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    n_groups = int(pd.Series(groups).nunique())
    mdl = LinearRegression()
    scores = []
    if n_groups >= 2:
        gkf = GroupKFold(n_splits=min(max_folds, n_groups))
        splits = gkf.split(X, y, groups=groups)
    else:
        n = len(X)
        if n < 3: return np.nan
        splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
    for tr, te in splits:
        mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_hat = mdl.predict(X.iloc[te].values)
        yt = y.iloc[te].values
        ss_res = np.sum((yt - y_hat)**2); ss_tot = np.sum((yt - np.mean(yt))**2)
        scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    cv  = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# ---------- LOAD ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map  = find_event_window_sheets(evt_book)

present = [f for f in FEATURE_FILES if any((b/f).exists() for b in BASE_DIRS)]
assert present, "Could not find features v1.xlsx or features v1.2.xlsx"

merge_audit = []
results = []

for fname in present:
    fpath = find_file(fname)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    df_feat_raw = feat_book[fsheet].copy()

    dfeat = find_day0_column(df_feat_raw)
    tfeat = find_ticker_column(df_feat_raw)
    feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)

    for w in WINDOWS:
        esheet = win_map.get(w)
        if esheet is None:
            print(f"Missing event sheet for window {w}. Skipping.")
            continue

        df_evt = evt_book[esheet].copy()
        devt = find_day0_column(df_evt)
        tevt = find_ticker_column(df_evt)
        ycol = find_target_col(df_evt)

        evt = df_evt.copy()
        evt["__day0__"]   = normalize_day0(evt[devt])
        evt["__ticker__"] = normalize_ticker(evt[tevt])
        evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

        merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        groups = merged["__ticker__"]
        X = build_X(merged, num_cols, ycol)
        y = merged[ycol].astype(float)

        merge_audit.append({
            "features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
            "day0_features_col": dfeat, "ticker_features_col": tfeat,
            "day0_event_col": devt, "ticker_event_col": tevt,
            "merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
        })

        m = fit_and_score(X, y, groups)
        m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
        results.append(m)

# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)

print("\nMerge audit:")
display(pd.DataFrame(merge_audit))

res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1 vs v1.2):")
display(res_df)

print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
                          columns="features_file",
                          values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
                          aggfunc="first")
display(wide)

# ---------- DELTAS vs baseline v1.2 ----------
pairs = []
for w in WINDOWS:
    base = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
    comp = res_df[(res_df["features_file"]=="features v1.xlsx") & (res_df["window"]==w)]
    if not base.empty and not comp.empty:
        pairs.append({
            "window": w,
            "delta_cv_r_squared_v1.2_minus_v1": float(base["cross_validated_r_squared"].iloc[0] - comp["cross_validated_r_squared"].iloc[0]),
            "delta_adjusted_r_squared_v1.2_minus_v1": float(base["adjusted_r_squared"].iloc[0] - comp["adjusted_r_squared"].iloc[0]),
            "delta_r_squared_v1.2_minus_v1": float(base["r_squared"].iloc[0] - comp["r_squared"].iloc[0]),
            "rows_used_v1.2": int(base["rows_used"].iloc[0]),
            "rows_used_v1": int(comp["rows_used"].iloc[0]),
            "features_used_v1.2": int(base["features_used"].iloc[0]),
            "features_used_v1": int(comp["features_used"].iloc[0]),
        })
if pairs:
    deltas = pd.DataFrame(pairs).sort_values("window").reset_index(drop=True)
    print("\nDeltas (v1.2 minus v1) — positive means v1.2 is better:")
    display(deltas)

# ---------- SAVE ----------
out_dir = evt_path.parent
res_df.to_csv(out_dir / "v1_vs_v1.2_results.csv", index=False)
wide.to_csv(out_dir / "v1_vs_v1.2_comparison_table.csv")
if pairs:
    deltas.to_csv(out_dir / "v1_vs_v1.2_deltas.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v1_vs_v1.2_results.csv")
print(" - v1_vs_v1.2_comparison_table.csv")
print(" - v1_vs_v1.2_deltas.csv")
Merge audit:
features_file features_sheet event_sheet window day0_features_col ticker_features_col day0_event_col ticker_event_col merged_rows predictor_cols target_col
0 features v1.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 16 CAR
1 features v1.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 16 CAR
2 features v1.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 16 CAR
3 features v1.2.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 8 CAR
4 features v1.2.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 8 CAR
5 features v1.2.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 8 CAR
Results (v1 vs v1.2):
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared features_file features_sheet window
0 129 8 0.245160 0.194838 0.068034 features v1.2.xlsx features 0,1
1 129 16 0.303485 0.203983 -0.115372 features v1.xlsx features 0,1
2 129 8 0.201481 0.148246 0.094267 features v1.2.xlsx features 0,3
3 129 16 0.250824 0.143799 -0.155072 features v1.xlsx features 0,3
4 129 8 0.214735 0.162384 0.121771 features v1.2.xlsx features 0,5
5 129 16 0.257400 0.151314 -0.089552 features v1.xlsx features 0,5
Comparison table (rows = windows | columns = metrics per file):
adjusted_r_squared cross_validated_r_squared r_squared
features_file features v1.2.xlsx features v1.xlsx features v1.2.xlsx features v1.xlsx features v1.2.xlsx features v1.xlsx
window
0,1 0.194838 0.203983 0.068034 -0.115372 0.245160 0.303485
0,3 0.148246 0.143799 0.094267 -0.155072 0.201481 0.250824
0,5 0.162384 0.151314 0.121771 -0.089552 0.214735 0.257400
Deltas (v1.2 minus v1) — positive means v1.2 is better:
window delta_cv_r_squared_v1.2_minus_v1 delta_adjusted_r_squared_v1.2_minus_v1 delta_r_squared_v1.2_minus_v1 rows_used_v1.2 rows_used_v1 features_used_v1.2 features_used_v1
0 0,1 0.183406 -0.009145 -0.058325 129 129 8 16
1 0,3 0.249339 0.004448 -0.049343 129 129 8 16
2 0,5 0.211322 0.011070 -0.042665 129 129 8 16
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v1_vs_v1.2_results.csv
 - v1_vs_v1.2_comparison_table.csv
 - v1_vs_v1.2_deltas.csv
In [1]:
# === Compare features v1.2 vs v1.3 (join on day0 + ticker) ===
# Metrics per window: R^2, Adjusted R^2, Cross-validated R^2 (grouped by ticker; safe fallback)
# Saves: results, wide comparison, and deltas vs v1.2
#
# If needed first:  pip install pandas numpy scikit-learn openpyxl

from pathlib import Path
import re
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["features v1.2.xlsx", "features v1.3.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(f"Could not find {name}")

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands: return next(iter(book))
    def score(item):
        _, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    out = {"0,1": None, "0,3": None, "0,5": None}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for nm in book:
        if is_readme_sheet(nm): continue
        for w, pat in pats.items():
            if out[w] is None and pat.search(str(nm)): out[w] = nm
    return out

def find_day0_column(df: pd.DataFrame):
    strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","ANNOUNCEMENT_DATE","announcement_date",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest: best, kbest = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score: best, score = c, sc
    return best

def normalize_day0(s: pd.Series):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series):
    return s.astype(str).str.strip().str.upper()

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.IGNORECASE)]
    return c2[0] if c2 else None

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    n_groups = int(pd.Series(groups).nunique())
    mdl = LinearRegression()
    scores = []
    if n_groups >= 2:
        gkf = GroupKFold(n_splits=min(max_folds, n_groups))
        splits = gkf.split(X, y, groups=groups)
    else:
        n = len(X)
        if n < 3: return np.nan
        splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
    for tr, te in splits:
        mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
        y_hat = mdl.predict(X.iloc[te].values)
        yt = y.iloc[te].values
        ss_res = np.sum((yt - y_hat)**2); ss_tot = np.sum((yt - np.mean(yt))**2)
        scores.append(1.0 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1.0 - (1.0 - r2)*(n - 1.0)/(n - p - 1.0) if (n - p - 1.0) > 0 else np.nan
    cv  = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# ---------- LOAD ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map  = find_event_window_sheets(evt_book)

present = [f for f in FEATURE_FILES if any((b/f).exists() for b in BASE_DIRS)]
assert present, "Could not find features v1.2.xlsx or features v1.3.xlsx"

merge_audit = []
results = []

for fname in present:
    fpath = find_file(fname)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    df_feat_raw = feat_book[fsheet].copy()

    dfeat = find_day0_column(df_feat_raw)
    tfeat = find_ticker_column(df_feat_raw)
    feat_g, num_cols = aggregate_features(df_feat_raw, dfeat, tfeat)

    for w in WINDOWS:
        esheet = win_map.get(w)
        if esheet is None:
            print(f"Missing event sheet for window {w}. Skipping.")
            continue

        df_evt = evt_book[esheet].copy()
        devt = find_day0_column(df_evt)
        tevt = find_ticker_column(df_evt)
        ycol = find_target_col(df_evt)

        evt = df_evt.copy()
        evt["__day0__"]   = normalize_day0(evt[devt])
        evt["__ticker__"] = normalize_ticker(evt[tevt])
        evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

        merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        groups = merged["__ticker__"]
        X = build_X(merged, num_cols, ycol)
        y = merged[ycol].astype(float)

        merge_audit.append({
            "features_file": fname, "features_sheet": fsheet, "event_sheet": esheet, "window": w,
            "day0_features_col": dfeat, "ticker_features_col": tfeat,
            "day0_event_col": devt, "ticker_event_col": tevt,
            "merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
        })

        m = fit_and_score(X, y, groups)
        m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
        results.append(m)

# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)

print("\nMerge audit:")
display(pd.DataFrame(merge_audit))

res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1.2 vs v1.3):")
display(res_df)

print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
                          columns="features_file",
                          values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
                          aggfunc="first")
display(wide)

# ---------- DELTAS vs baseline v1.2 (positive = v1.3 is better) ----------
pairs = []
for w in WINDOWS:
    base = res_df[(res_df["features_file"]=="features v1.2.xlsx") & (res_df["window"]==w)]
    comp = res_df[(res_df["features_file"]=="features v1.3.xlsx") & (res_df["window"]==w)]
    if not base.empty and not comp.empty:
        pairs.append({
            "window": w,
            "delta_cv_r_squared_v13_minus_v12": float(comp["cross_validated_r_squared"].iloc[0] - base["cross_validated_r_squared"].iloc[0]),
            "delta_adjusted_r_squared_v13_minus_v12": float(comp["adjusted_r_squared"].iloc[0] - base["adjusted_r_squared"].iloc[0]),
            "delta_r_squared_v13_minus_v12": float(comp["r_squared"].iloc[0] - base["r_squared"].iloc[0]),
            "rows_used_v1.2": int(base["rows_used"].iloc[0]),
            "rows_used_v1.3": int(comp["rows_used"].iloc[0]),
            "features_used_v1.2": int(base["features_used"].iloc[0]),
            "features_used_v1.3": int(comp["features_used"].iloc[0]),
        })
if pairs:
    deltas = pd.DataFrame(pairs).sort_values("window").reset_index(drop=True)
    print("\nDeltas (v1.3 minus v1.2) — positive means v1.3 is better:")
    display(deltas)

# ---------- SAVE ----------
out_dir = evt_path.parent
res_df.to_csv(out_dir / "v1.2_vs_v1.3_results.csv", index=False)
wide.to_csv(out_dir / "v1.2_vs_v1.3_comparison_table.csv")
if pairs:
    deltas.to_csv(out_dir / "v1.2_vs_v1.3_deltas.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v1.2_vs_v1.3_results.csv")
print(" - v1.2_vs_v1.3_comparison_table.csv")
print(" - v1.2_vs_v1.3_deltas.csv")
Merge audit:
features_file features_sheet event_sheet window day0_features_col ticker_features_col day0_event_col ticker_event_col merged_rows predictor_cols target_col
0 features v1.2.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 8 CAR
1 features v1.2.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 8 CAR
2 features v1.2.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 8 CAR
3 features v1.3.xlsx features CAR_(0,1) 0,1 day0 ticker day0 ticker 129 7 CAR
4 features v1.3.xlsx features CAR_(0,3) 0,3 day0 ticker day0 ticker 129 7 CAR
5 features v1.3.xlsx features CAR_(0,5) 0,5 day0 ticker day0 ticker 129 7 CAR
Results (v1.2 vs v1.3):
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared features_file features_sheet window
0 129 8 0.245160 0.194838 0.068034 features v1.2.xlsx features 0,1
1 129 7 0.245005 0.201328 0.072892 features v1.3.xlsx features 0,1
2 129 8 0.201481 0.148246 0.094267 features v1.2.xlsx features 0,3
3 129 7 0.201430 0.155231 0.099199 features v1.3.xlsx features 0,3
4 129 8 0.214735 0.162384 0.121771 features v1.2.xlsx features 0,5
5 129 7 0.214615 0.169179 0.133885 features v1.3.xlsx features 0,5
Comparison table (rows = windows | columns = metrics per file):
adjusted_r_squared cross_validated_r_squared r_squared
features_file features v1.2.xlsx features v1.3.xlsx features v1.2.xlsx features v1.3.xlsx features v1.2.xlsx features v1.3.xlsx
window
0,1 0.194838 0.201328 0.068034 0.072892 0.245160 0.245005
0,3 0.148246 0.155231 0.094267 0.099199 0.201481 0.201430
0,5 0.162384 0.169179 0.121771 0.133885 0.214735 0.214615
Deltas (v1.3 minus v1.2) — positive means v1.3 is better:
window delta_cv_r_squared_v13_minus_v12 delta_adjusted_r_squared_v13_minus_v12 delta_r_squared_v13_minus_v12 rows_used_v1.2 rows_used_v1.3 features_used_v1.2 features_used_v1.3
0 0,1 0.004858 0.006490 -0.000155 129 129 8 7
1 0,3 0.004932 0.006985 -0.000051 129 129 8 7
2 0,5 0.012114 0.006795 -0.000120 129 129 8 7
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\1. Data
 - v1.2_vs_v1.3_results.csv
 - v1.2_vs_v1.3_comparison_table.csv
 - v1.2_vs_v1.3_deltas.csv
In [5]:
# === Compare Baseline v1.xlsx vs v1.1.xlsx vs v1.2.xlsx on event_study_2.xlsx ===
# Windows: 0,1  0,3  0,5  0,10  0,15  0,20
# Join on day0 + ticker; grouped CV by ticker; saves CSVs next to the event file.

from pathlib import Path
import re, numpy as np, pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ---------- CONFIG (updated base folder) ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study_2.xlsx"
FEATURE_FILES = ["Baseline v1.xlsx", "v1.1.xlsx", "v1.2.xlsx"]
WINDOWS = ["0,1","0,3","0,5","0,10","0,15","0,20"]
MAX_GROUP_FOLDS = 5

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(f"Could not find: {name}")

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands: return next(iter(book))
    def score(item):
        _, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    out = {w: None for w in WINDOWS}
    pats = {w: re.compile(rf"(car.*)?0\D*{w.split(',')[1]}(?!\d)", re.IGNORECASE) for w in WINDOWS}
    for nm in book:
        if is_readme_sheet(nm): continue
        for w, pat in pats.items():
            if out[w] is None and pat.search(str(nm)): out[w] = nm
    return out

def find_day0_column(df: pd.DataFrame):
    strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest: best, kbest = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score: best, score = c, sc
    return best

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def normalize_day0(s: pd.Series):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series):
    return s.astype(str).str.strip().str.upper()

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    n_groups = int(pd.Series(groups).nunique())
    mdl = LinearRegression()
    scores = []
    if n_groups >= 2:
        gkf = GroupKFold(n_splits=min(max_folds, n_groups))
        splits = gkf.split(X, y, groups=groups)
    else:
        n = len(X)
        if n < 3: return np.nan
        splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
    for tr, te in splits:
        mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
        yh = mdl.predict(X.iloc[te].values)
        yt = y.iloc[te].values
        ss_res = ((yt - yh)**2).sum(); ss_tot = ((yt - yt.mean())**2).sum()
        scores.append(1 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1 - (1 - r2)*(n - 1)/(n - p - 1) if (n - p - 1) > 0 else np.nan
    cv  = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# ---------- LOAD EVENT STUDY ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map  = find_event_window_sheets(evt_book)

# ---------- RUN ----------
merge_audit, results = [], []

for fname in FEATURE_FILES:
    fpath = find_file(fname)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    feat_raw = feat_book[fsheet].copy()

    dfeat = find_day0_column(feat_raw)
    tfeat = find_ticker_column(feat_raw)
    feat_g, num_cols = aggregate_features(feat_raw, dfeat, tfeat)

    for w in WINDOWS:
        esheet = win_map.get(w)
        if esheet is None:
            print(f"Skip window {w}: no matching sheet in {EVENT_FILE}.")
            continue

        evt_raw = evt_book[esheet].copy()
        devt = find_day0_column(evt_raw)
        tevt = find_ticker_column(evt_raw)
        ycol = find_target_col(evt_raw)

        evt = evt_raw.copy()
        evt["__day0__"]   = normalize_day0(evt[devt])
        evt["__ticker__"] = normalize_ticker(evt[tevt])
        evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

        merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        X = build_X(merged, num_cols, ycol)
        y = merged[ycol].astype(float)
        groups = merged["__ticker__"]

        merge_audit.append({
            "features_file": fname, "features_sheet": fsheet,
            "event_sheet": esheet, "window": w,
            "merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
        })

        m = fit_and_score(X, y, groups)
        m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
        results.append(m)

# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)

print("\nMerge audit:")
display(pd.DataFrame(merge_audit))

res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1 vs v1.1 vs v1.2):")
display(res_df)

print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
                          columns="features_file",
                          values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
                          aggfunc="first")
display(wide)

# Per-window winner by cross-validated R^2
winners = (res_df.sort_values(["window","cross_validated_r_squared"], ascending=[True,False])
                 .groupby("window").first().reset_index())
winners = winners[["window","features_file","cross_validated_r_squared","adjusted_r_squared","r_squared","rows_used","features_used"]]
print("\nBest per window (by cross-validated R^2):")
display(winners)

# ---------- SAVE ----------
out_dir = evt_path.parent
res_df.to_csv(out_dir / "v1_v1.1_v1.2_results.csv", index=False)
wide.to_csv(out_dir / "v1_v1.1_v1.2_comparison_table.csv")
winners.to_csv(out_dir / "v1_v1.1_v1.2_best_per_window.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v1_v1.1_v1.2_results.csv")
print(" - v1_v1.1_v1.2_comparison_table.csv")
print(" - v1_v1.1_v1.2_best_per_window.csv")
Merge audit:
features_file features_sheet event_sheet window merged_rows predictor_cols target_col
0 Baseline v1.xlsx features CAR_(0,1) 0,1 129 16 CAR
1 Baseline v1.xlsx features CAR_(0,3) 0,3 129 16 CAR
2 Baseline v1.xlsx features CAR_(0,5) 0,5 129 16 CAR
3 Baseline v1.xlsx features CAR_(0,10) 0,10 129 16 CAR
4 Baseline v1.xlsx features CAR_(0,15) 0,15 129 16 CAR
5 Baseline v1.xlsx features CAR_(0,20) 0,20 129 16 CAR
6 v1.1.xlsx features CAR_(0,1) 0,1 129 8 CAR
7 v1.1.xlsx features CAR_(0,3) 0,3 129 8 CAR
8 v1.1.xlsx features CAR_(0,5) 0,5 129 8 CAR
9 v1.1.xlsx features CAR_(0,10) 0,10 129 8 CAR
10 v1.1.xlsx features CAR_(0,15) 0,15 129 8 CAR
11 v1.1.xlsx features CAR_(0,20) 0,20 129 8 CAR
12 v1.2.xlsx features CAR_(0,1) 0,1 129 7 CAR
13 v1.2.xlsx features CAR_(0,3) 0,3 129 7 CAR
14 v1.2.xlsx features CAR_(0,5) 0,5 129 7 CAR
15 v1.2.xlsx features CAR_(0,10) 0,10 129 7 CAR
16 v1.2.xlsx features CAR_(0,15) 0,15 129 7 CAR
17 v1.2.xlsx features CAR_(0,20) 0,20 129 7 CAR
Results (v1 vs v1.1 vs v1.2):
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared features_file features_sheet window
0 129 16 0.303485 0.203983 -0.115372 Baseline v1.xlsx features 0,1
1 129 8 0.245160 0.194838 0.068034 v1.1.xlsx features 0,1
2 129 7 0.245005 0.201328 0.072892 v1.2.xlsx features 0,1
3 129 16 0.196103 0.081260 -0.852816 Baseline v1.xlsx features 0,10
4 129 8 0.145079 0.088084 -0.523390 v1.1.xlsx features 0,10
5 129 7 0.144797 0.095323 -0.509257 v1.2.xlsx features 0,10
6 129 16 0.190205 0.074520 -1.004755 Baseline v1.xlsx features 0,15
7 129 8 0.096643 0.036420 -0.489696 v1.1.xlsx features 0,15
8 129 7 0.093859 0.041437 -0.505546 v1.2.xlsx features 0,15
9 129 16 0.390604 0.303547 -0.209408 Baseline v1.xlsx features 0,20
10 129 8 0.201758 0.148542 -0.336828 v1.1.xlsx features 0,20
11 129 7 0.198429 0.152057 -0.318275 v1.2.xlsx features 0,20
12 129 16 0.250824 0.143799 -0.155072 Baseline v1.xlsx features 0,3
13 129 8 0.201481 0.148246 0.094267 v1.1.xlsx features 0,3
14 129 7 0.201430 0.155231 0.099199 v1.2.xlsx features 0,3
15 129 16 0.257400 0.151314 -0.089552 Baseline v1.xlsx features 0,5
16 129 8 0.214735 0.162384 0.121771 v1.1.xlsx features 0,5
17 129 7 0.214615 0.169179 0.133885 v1.2.xlsx features 0,5
Comparison table (rows = windows | columns = metrics per file):
adjusted_r_squared cross_validated_r_squared r_squared
features_file Baseline v1.xlsx v1.1.xlsx v1.2.xlsx Baseline v1.xlsx v1.1.xlsx v1.2.xlsx Baseline v1.xlsx v1.1.xlsx v1.2.xlsx
window
0,1 0.203983 0.194838 0.201328 -0.115372 0.068034 0.072892 0.303485 0.245160 0.245005
0,10 0.081260 0.088084 0.095323 -0.852816 -0.523390 -0.509257 0.196103 0.145079 0.144797
0,15 0.074520 0.036420 0.041437 -1.004755 -0.489696 -0.505546 0.190205 0.096643 0.093859
0,20 0.303547 0.148542 0.152057 -0.209408 -0.336828 -0.318275 0.390604 0.201758 0.198429
0,3 0.143799 0.148246 0.155231 -0.155072 0.094267 0.099199 0.250824 0.201481 0.201430
0,5 0.151314 0.162384 0.169179 -0.089552 0.121771 0.133885 0.257400 0.214735 0.214615
Best per window (by cross-validated R^2):
window features_file cross_validated_r_squared adjusted_r_squared r_squared rows_used features_used
0 0,1 v1.2.xlsx 0.072892 0.201328 0.245005 129 7
1 0,10 v1.2.xlsx -0.509257 0.095323 0.144797 129 7
2 0,15 v1.1.xlsx -0.489696 0.036420 0.096643 129 8
3 0,20 Baseline v1.xlsx -0.209408 0.303547 0.390604 129 16
4 0,3 v1.2.xlsx 0.099199 0.155231 0.201430 129 7
5 0,5 v1.2.xlsx 0.133885 0.169179 0.214615 129 7
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model
 - v1_v1.1_v1.2_results.csv
 - v1_v1.1_v1.2_comparison_table.csv
 - v1_v1.1_v1.2_best_per_window.csv
In [7]:
# === Compare Baseline v1.xlsx vs v1.1.xlsx vs v1.2.xlsx on event_study.xlsx ===
# Windows: 0,1  0,3  0,5
# Join on day0 + ticker; grouped CV by ticker; saves CSVs next to the event file.

from pathlib import Path
import re, numpy as np, pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ---------- CONFIG ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["Baseline v1.xlsx", "v1.1.xlsx", "v1.2.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5

# ---------- HELPERS ----------
def find_file(name: str):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(f"Could not find: {name}")

def is_readme_sheet(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def choose_features_sheet(book: dict) -> str:
    cands = [(n, df) for n, df in book.items() if not is_readme_sheet(n)]
    if not cands: return next(iter(book))
    def score(item):
        _, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_event_window_sheets(book: dict):
    out = {w: None for w in WINDOWS}
    pats = {
        "0,1": re.compile(r"(car.*)?0\D*1(?!\d)", re.IGNORECASE),
        "0,3": re.compile(r"(car.*)?0\D*3(?!\d)", re.IGNORECASE),
        "0,5": re.compile(r"(car.*)?0\D*5(?!\d)", re.IGNORECASE),
    }
    for nm in book:
        if is_readme_sheet(nm): continue
        for w, pat in pats.items():
            if out[w] is None and pat.search(str(nm)): out[w] = nm
    return out

def find_day0_column(df: pd.DataFrame):
    strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), flags=re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest: best, kbest = c, k
    return best

def find_ticker_column(df: pd.DataFrame):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score: best, score = c, sc
    return best

def find_target_col(df: pd.DataFrame):
    c1 = [c for c in df.columns if re.search(r"\bcar\b", str(c), flags=re.IGNORECASE)]
    if c1: return c1[0]
    c2 = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), flags=re.IGNORECASE)]
    return c2[0] if c2 else None

def normalize_day0(s: pd.Series):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def normalize_ticker(s: pd.Series):
    return s.astype(str).str.strip().str.upper()

def aggregate_features(df_feat_raw: pd.DataFrame, day0_col: str, ticker_col: str):
    df = df_feat_raw.copy()
    df["__day0__"]   = normalize_day0(df[day0_col])
    df["__ticker__"] = normalize_ticker(df[ticker_col])
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    g = df.groupby(["__day0__","__ticker__"], as_index=False)[num_cols].mean()
    g = g.dropna(subset=["__day0__","__ticker__"])
    return g, num_cols

def build_X(merged: pd.DataFrame, numeric_cols: list, target_col: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].copy()
    X = X.drop(columns=[target_col], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def safe_grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    n_groups = int(pd.Series(groups).nunique())
    mdl = LinearRegression()
    scores = []
    if n_groups >= 2:
        gkf = GroupKFold(n_splits=min(max_folds, n_groups))
        splits = gkf.split(X, y, groups=groups)
    else:
        n = len(X)
        if n < 3: return np.nan
        splits = KFold(n_splits=min(3, n), shuffle=True, random_state=42).split(X, y)
    for tr, te in splits:
        mdl.fit(X.iloc[tr].values, y.iloc[tr].values)
        yh = mdl.predict(X.iloc[te].values)
        yt = y.iloc[te].values
        ss_res = ((yt - yh)**2).sum(); ss_tot = ((yt - yt.mean())**2).sum()
        scores.append(1 - ss_res/ss_tot if ss_tot > 0 else np.nan)
    return float(np.nanmean(scores))

def fit_and_score(X: pd.DataFrame, y: pd.Series, groups: pd.Series):
    data = pd.concat([y, X], axis=1).dropna()
    y_c, X_c = data.iloc[:,0], data.iloc[:,1:]
    n, p = len(y_c), X_c.shape[1]
    if p == 0 or n < max(10, p+2):
        return dict(rows_used=int(n), features_used=int(p),
                    r_squared=np.nan, adjusted_r_squared=np.nan, cross_validated_r_squared=np.nan)
    mdl = LinearRegression().fit(X_c.values, y_c.values)
    r2 = float(mdl.score(X_c.values, y_c.values))
    adj = 1 - (1 - r2)*(n - 1)/(n - p - 1) if (n - p - 1) > 0 else np.nan
    cv  = safe_grouped_cv_r2(X_c, y_c, groups.loc[X_c.index], max_folds=MAX_GROUP_FOLDS)
    return dict(rows_used=int(n), features_used=int(p),
                r_squared=r2, adjusted_r_squared=adj, cross_validated_r_squared=cv)

# ---------- LOAD ----------
evt_path = find_file(EVENT_FILE)
evt_book = pd.read_excel(evt_path, sheet_name=None, engine="openpyxl")
win_map  = find_event_window_sheets(evt_book)

# ---------- RUN ----------
merge_audit, results = [], []

for fname in FEATURE_FILES:
    fpath = find_file(fname)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    feat_raw = feat_book[fsheet].copy()

    dfeat = find_day0_column(feat_raw)
    tfeat = find_ticker_column(feat_raw)
    feat_g, num_cols = aggregate_features(feat_raw, dfeat, tfeat)

    for w in WINDOWS:
        esheet = win_map.get(w)
        if esheet is None:
            print(f"Skip window {w}: no matching sheet in {EVENT_FILE}.")
            continue

        evt_raw = evt_book[esheet].copy()
        devt = find_day0_column(evt_raw)
        tevt = find_ticker_column(evt_raw)
        ycol = find_target_col(evt_raw)

        evt = evt_raw.copy()
        evt["__day0__"]   = normalize_day0(evt[devt])
        evt["__ticker__"] = normalize_ticker(evt[tevt])
        evt = evt.dropna(subset=["__day0__","__ticker__", ycol]).drop_duplicates(subset=["__day0__","__ticker__"])

        merged = feat_g.merge(evt[["__day0__","__ticker__", ycol]], on=["__day0__","__ticker__"], how="inner")
        X = build_X(merged, num_cols, ycol)
        y = merged[ycol].astype(float)
        groups = merged["__ticker__"]

        merge_audit.append({
            "features_file": fname, "features_sheet": fsheet,
            "event_sheet": esheet, "window": w,
            "merged_rows": len(merged), "predictor_cols": X.shape[1], "target_col": ycol
        })

        m = fit_and_score(X, y, groups)
        m.update(dict(features_file=fname, features_sheet=fsheet, window=w))
        results.append(m)

# ---------- DISPLAY ----------
pd.set_option("display.max_columns", None)

print("\nMerge audit:")
display(pd.DataFrame(merge_audit))

res_df = pd.DataFrame(results).sort_values(["window","features_file"]).reset_index(drop=True)
print("\nResults (v1 vs v1.1 vs v1.2) — windows 0,1 / 0,3 / 0,5:")
display(res_df)

print("\nComparison table (rows = windows | columns = metrics per file):")
wide = res_df.pivot_table(index="window",
                          columns="features_file",
                          values=["r_squared","adjusted_r_squared","cross_validated_r_squared"],
                          aggfunc="first")
display(wide)

# Per-window winner by cross-validated R^2
winners = (res_df.sort_values(["window","cross_validated_r_squared"], ascending=[True,False])
                 .groupby("window").first().reset_index())
winners = winners[["window","features_file","cross_validated_r_squared","adjusted_r_squared","r_squared","rows_used","features_used"]]
print("\nBest per window (by cross-validated R^2):")
display(winners)

# ---------- SAVE ----------
out_dir = evt_path.parent
res_df.to_csv(out_dir / "v1_v1.1_v1.2_results_windows_0_1_0_3_0_5.csv", index=False)
wide.to_csv(out_dir / "v1_v1.1_v1.2_comparison_windows_0_1_0_3_0_5.csv")
winners.to_csv(out_dir / "v1_v1.1_v1.2_best_per_window_0_1_0_3_0_5.csv", index=False)
print(f"\nSaved to: {out_dir}")
print(" - v1_v1.1_v1.2_results_windows_0_1_0_3_0_5.csv")
print(" - v1_v1.1_v1.2_comparison_windows_0_1_0_3_0_5.csv")
print(" - v1_v1.1_v1.2_best_per_window_0_1_0_3_0_5.csv")
Merge audit:
features_file features_sheet event_sheet window merged_rows predictor_cols target_col
0 Baseline v1.xlsx features CAR_(0,1) 0,1 129 16 CAR
1 Baseline v1.xlsx features CAR_(0,3) 0,3 129 16 CAR
2 Baseline v1.xlsx features CAR_(0,5) 0,5 129 16 CAR
3 v1.1.xlsx features CAR_(0,1) 0,1 129 8 CAR
4 v1.1.xlsx features CAR_(0,3) 0,3 129 8 CAR
5 v1.1.xlsx features CAR_(0,5) 0,5 129 8 CAR
6 v1.2.xlsx features CAR_(0,1) 0,1 129 7 CAR
7 v1.2.xlsx features CAR_(0,3) 0,3 129 7 CAR
8 v1.2.xlsx features CAR_(0,5) 0,5 129 7 CAR
Results (v1 vs v1.1 vs v1.2) — windows 0,1 / 0,3 / 0,5:
rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared features_file features_sheet window
0 129 16 0.303485 0.203983 -0.115372 Baseline v1.xlsx features 0,1
1 129 8 0.245160 0.194838 0.068034 v1.1.xlsx features 0,1
2 129 7 0.245005 0.201328 0.072892 v1.2.xlsx features 0,1
3 129 16 0.250824 0.143799 -0.155072 Baseline v1.xlsx features 0,3
4 129 8 0.201481 0.148246 0.094267 v1.1.xlsx features 0,3
5 129 7 0.201430 0.155231 0.099199 v1.2.xlsx features 0,3
6 129 16 0.257400 0.151314 -0.089552 Baseline v1.xlsx features 0,5
7 129 8 0.214735 0.162384 0.121771 v1.1.xlsx features 0,5
8 129 7 0.214615 0.169179 0.133885 v1.2.xlsx features 0,5
Comparison table (rows = windows | columns = metrics per file):
adjusted_r_squared cross_validated_r_squared r_squared
features_file Baseline v1.xlsx v1.1.xlsx v1.2.xlsx Baseline v1.xlsx v1.1.xlsx v1.2.xlsx Baseline v1.xlsx v1.1.xlsx v1.2.xlsx
window
0,1 0.203983 0.194838 0.201328 -0.115372 0.068034 0.072892 0.303485 0.245160 0.245005
0,3 0.143799 0.148246 0.155231 -0.155072 0.094267 0.099199 0.250824 0.201481 0.201430
0,5 0.151314 0.162384 0.169179 -0.089552 0.121771 0.133885 0.257400 0.214735 0.214615
Best per window (by cross-validated R^2):
window features_file cross_validated_r_squared adjusted_r_squared r_squared rows_used features_used
0 0,1 v1.2.xlsx 0.072892 0.201328 0.245005 129 7
1 0,3 v1.2.xlsx 0.099199 0.155231 0.201430 129 7
2 0,5 v1.2.xlsx 0.133885 0.169179 0.214615 129 7
Saved to: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model
 - v1_v1.1_v1.2_results_windows_0_1_0_3_0_5.csv
 - v1_v1.1_v1.2_comparison_windows_0_1_0_3_0_5.csv
 - v1_v1.1_v1.2_best_per_window_0_1_0_3_0_5.csv
In [9]:
# === Visualise v1 vs v1.1 vs v1.2 on windows 0,1 / 0,3 / 0,5 ===
# Requires: pandas, numpy, scikit-learn, openpyxl, matplotlib
# pip install pandas numpy scikit-learn openpyxl matplotlib

from pathlib import Path
import re, numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold

# ----------------- CONFIG -----------------
BASE_DIRS = [Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
             Path("."), Path("/mnt/data")]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = {
    "Baseline v1.xlsx": "v1",
    "v1.1.xlsx": "v1.1",
    "v1.2.xlsx": "v1.2",
}
WINDOWS = ["0,1", "0,3", "0,5"]
MAX_GROUP_FOLDS = 5

# colours for models (distinct)
MODEL_COLOURS = {"v1":"#1f77b4", "v1.1":"#ff7f0e", "v1.2":"#2ca02c"}

# ----------------- HELPERS -----------------
def find_file(name):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(f"Could not find: {name}")

def is_readme(name): 
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))

def choose_features_sheet(book):
    cands = [(n, df) for n, df in book.items() if not is_readme(n)]
    if not cands: return next(iter(book))
    def score(x):
        n, df = x
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def window_sheets(book):
    out = {"0,1":None,"0,3":None,"0,5":None}
    pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
    for nm in book:
        if is_readme(nm): continue
        for w,pat in pats.items():
            if out[w] is None and re.search(pat, str(nm), re.I): out[w]=nm
    return out

def find_day0(df):
    strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best,k=None,-1
    for c in df.columns:
        kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if kk>k: best,k=c,kk
    return best

def find_ticker(df):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best,score=None,-1
    for c in obj:
        s=df[c].astype(str).str.strip()
        sc=s.nunique() - 0.1*s.str.len().mean()
        if sc>score: best,score=c,sc
    return best

def find_target(df):
    c = [c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
    if c: return c[0]
    c = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
    return c[0] if c else None

def norm_day0(s):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def norm_ticker(s): return s.astype(str).str.strip().str.upper()

def group_numeric(df, dcol, tcol):
    g = df.copy()
    g["__day0__"] = norm_day0(g[dcol]); g["__tic__"] = norm_ticker(g[tcol])
    nums = g.select_dtypes(include=[np.number]).columns.tolist()
    g = (g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
           .dropna(subset=["__day0__","__tic__"]))
    return g, nums

def build_X(merged, cols, ycol):
    keep=[c for c in cols if c in merged.columns]
    X = merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq>1]

def cv_r2_and_oof_preds(X, y, groups):
    n_groups = int(pd.Series(groups).nunique())
    if len(X)<3: return np.nan, np.full(len(y), np.nan)
    if n_groups >= 2:
        splitter = GroupKFold(n_splits=min(MAX_GROUP_FOLDS, n_groups))
        splits = splitter.split(X, y, groups=groups)
    else:
        splitter = KFold(n_splits=min(3, len(X)), shuffle=True, random_state=42)
        splits = splitter.split(X, y)
    oof = np.full(len(y), np.nan)
    scores=[]
    for tr,te in splits:
        m = LinearRegression().fit(X.iloc[tr].values, y.iloc[tr].values)
        yh = m.predict(X.iloc[te].values)
        yt = y.iloc[te].values
        oof[te] = yh
        ss_res = np.sum((yt - yh)**2)
        ss_tot = np.sum((yt - yt.mean())**2)
        scores.append(1 - ss_res/ss_tot if ss_tot>0 else np.nan)
    return float(np.nanmean(scores)), oof

def insample_r2_adj(X, y):
    m = LinearRegression().fit(X.values, y.values)
    r2 = float(m.score(X.values, y.values))
    n,p = len(y), X.shape[1]
    adj = 1 - (1 - r2)*(n - 1)/(n - p - 1) if (n - p - 1)>0 else np.nan
    return r2, adj

# ----------------- LOAD -----------------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)

all_rows=[]
oof_store={}   # (model, window) -> (y_true, y_oof)

for ffile, tag in [(find_file(k), v) for k,v in FEATURE_FILES.items()]:
    f_book = pd.read_excel(ffile, sheet_name=None, engine="openpyxl")
    f_sheet = choose_features_sheet(f_book)
    raw = f_book[f_sheet].copy()

    dcol, tcol = find_day0(raw), find_ticker(raw)
    feat_g, feat_cols = group_numeric(raw, dcol, tcol)

    for w in WINDOWS:
        es = win_map[w]
        ev = evt_book[es].copy()
        ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
        ev["__day0__"] = norm_day0(ev[ed]); ev["__tic__"] = norm_ticker(ev[et])
        ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])

        merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")
        X = build_X(merged, feat_cols, ycol)
        y = merged[ycol].astype(float)
        groups = merged["__tic__"]

        if len(X)==0:
            all_rows.append({"model":tag,"window":w,"rows_used":0,"features_used":0,
                             "r_squared":np.nan,"adjusted_r_squared":np.nan,"cross_validated_r_squared":np.nan})
            continue

        r2, adj = insample_r2_adj(X, y)
        cv, oof = cv_r2_and_oof_preds(X, y, groups)

        all_rows.append({"model":tag,"window":w,"rows_used":len(X),"features_used":X.shape[1],
                         "r_squared":r2,"adjusted_r_squared":adj,"cross_validated_r_squared":cv})

        oof_store[(tag,w)] = (y.values, oof)

res = pd.DataFrame(all_rows).sort_values(["window","model"]).reset_index(drop=True)

# Save results
out_dir = find_file(EVENT_FILE).parent
res.to_csv(out_dir/"viz_v1_v1.1_v1.2_results.csv", index=False)

# ----------------- PLOTS -----------------
# Bar charts: cross-validated coefficient of determination and adjusted coefficient of determination
for metric, title in [("cross_validated_r_squared","Cross-validated R^2"),
                      ("adjusted_r_squared","Adjusted R^2")]:
    fig = plt.figure(figsize=(8,5))
    idx = np.arange(len(WINDOWS))
    width = 0.22
    offsets = {"v1":-width, "v1.1":0.0, "v1.2":width}
    for model in ["v1","v1.1","v1.2"]:
        vals = [float(res[(res.window==w)&(res.model==model)][metric]) for w in WINDOWS]
        plt.bar(idx + offsets[model], vals, width, label=model, color=MODEL_COLOURS[model])
    plt.xticks(idx, WINDOWS)
    plt.ylabel(title)
    plt.title(f"{title} — v1 vs v1.1 vs v1.2")
    plt.legend()
    plt.tight_layout()
    fig.savefig(out_dir/f"{metric}_bars_v1_v11_v12.png", dpi=150)
    plt.show()

# Line graph: coefficient of determination across windows
fig = plt.figure(figsize=(8,5))
for model in ["v1","v1.1","v1.2"]:
    vals = [float(res[(res.window==w)&(res.model==model)]["r_squared"]) for w in WINDOWS]
    plt.plot(WINDOWS, vals, marker="o", label=model, color=MODEL_COLOURS[model])
plt.ylabel("R^2")
plt.title("R^2 across windows — v1 vs v1.1 vs v1.2")
plt.legend()
plt.tight_layout()
fig.savefig(out_dir/"r2_lines_v1_v11_v12.png", dpi=150)
plt.show()

# Scatter plots with line of best fit (out-of-fold predictions) for each model and window
def scatter_with_fit(y_true, y_pred, title, save_path):
    fig = plt.figure(figsize=(5,5))
    plt.scatter(y_true, y_pred, s=18, alpha=0.7)
    # best fit line (y_pred on y_true)
    ok = np.isfinite(y_true) & np.isfinite(y_pred)
    if ok.sum() >= 2:
        a,b = np.polyfit(y_true[ok], y_pred[ok], 1)
        xs = np.linspace(np.nanmin(y_true[ok]), np.nanmax(y_true[ok]), 100)
        plt.plot(xs, a*xs + b, linestyle="--")
    # 45-degree reference
    lim = np.nanmax(np.abs(np.concatenate([y_true[ok], y_pred[ok]]))) if ok.any() else 1.0
    lim = float(lim)*1.05
    plt.plot([-lim, lim], [-lim, lim], linestyle=":")
    plt.xlim(-lim, lim); plt.ylim(-lim, lim)
    plt.xlabel("Actual CAR")
    plt.ylabel("Predicted CAR (OOF)")
    plt.title(title)
    plt.tight_layout()
    fig.savefig(save_path, dpi=150)
    plt.show()

for model in ["v1","v1.1","v1.2"]:
    for w in WINDOWS:
        if (model, w) in oof_store:
            y_true, y_oof = oof_store[(model,w)]
            scatter_with_fit(y_true, y_oof,
                             title=f"{model} — window {w} (OOF)",
                             save_path=out_dir/f"scatter_oof_{model}_window_{w.replace(',','_')}.png")

print(f"Saved figures and CSV in: {out_dir}")
C:\Users\dcazo\AppData\Local\Temp\ipykernel_22912\4005354297.py:185: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  vals = [float(res[(res.window==w)&(res.model==model)][metric]) for w in WINDOWS]
No description has been provided for this image
C:\Users\dcazo\AppData\Local\Temp\ipykernel_22912\4005354297.py:185: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  vals = [float(res[(res.window==w)&(res.model==model)][metric]) for w in WINDOWS]
No description has been provided for this image
C:\Users\dcazo\AppData\Local\Temp\ipykernel_22912\4005354297.py:198: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  vals = [float(res[(res.window==w)&(res.model==model)]["r_squared"]) for w in WINDOWS]
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
Saved figures and CSV in: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model
In [2]:
# === Feature importance for v1.2 (7 features) with grouped cross validation ===
# Tests: permutation drop, leave-one-feature-out drop, mean abs standardized coefficient
# Join on day0 + ticker; no engineered event columns as predictors
# Saves a CSV per window and prints a sorted table
#
# pip install pandas numpy scikit-learn openpyxl matplotlib

from pathlib import Path
import re, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

# ---------------- CONFIG ----------------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
    Path("."), Path("/mnt/data")
]
EVENT_FILE = "event_study.xlsx"
FEATURES_FILE = "v1.2.xlsx"   # your baseline file with the 7 features
WINDOWS_TO_SCORE = ["0,5"]    # change to ["0,1","0,3","0,5"] if you want all
MAX_GROUP_FOLDS = 5
FORCE_FEATURES = None         # put a list here if you want to force exactly 7 feature names

# -------------- HELPERS --------------
def find_file(name):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(name)

def is_readme(name): 
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))

def choose_features_sheet(book):
    cands = [(n, df) for n, df in book.items() if not is_readme(n)]
    if not cands: return next(iter(book))
    def score(x):
        _, df = x
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def window_sheets(book):
    out = {"0,1":None,"0,3":None,"0,5":None}
    pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
    for nm in book:
        if is_readme(nm): continue
        for w,pat in pats.items():
            if out[w] is None and re.search(pat, str(nm), re.I): out[w]=nm
    return out

def find_day0(df):
    s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
    if s: return s[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
        if c in df.columns: return c
    # fallback: most date-like
    best,k=None,-1
    for c in df.columns:
        kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if kk>k: best,k=c,kk
    return best

def find_ticker(df):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best,score=None,-1
    for c in obj:
        s=df[c].astype(str).str.strip()
        sc=s.nunique() - 0.1*s.str.len().mean()
        if sc>score: best,score=c,sc
    return best

def find_target(df):
    c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
    if c: return c[0]
    c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
    return c[0] if c else None

def norm_day0(s):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def norm_tic(s): return s.astype(str).str.strip().str.upper()

def group_numeric(df, dcol, tcol):
    g=df.copy()
    g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
    nums=g.select_dtypes(include=[np.number]).columns.tolist()
    g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
         .dropna(subset=["__day0__","__tic__"]))
    return g, nums

def build_X(merged, cols, ycol):
    keep=[c for c in cols if c in merged.columns]
    X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
    nunq=X.nunique(dropna=False)
    return X.loc[:, nunq>1]

def grouped_splits(X, y, groups):
    ng = int(pd.Series(groups).nunique())
    if ng>=2:
        gkf = GroupKFold(n_splits=min(MAX_GROUP_FOLDS, ng))
        return list(gkf.split(X, y, groups=groups))
    # fallback when groups are too few
    k = min(3, len(X))
    return list(KFold(n_splits=k, shuffle=True, random_state=42).split(X, y))

def fit_and_score(Xtr, ytr, Xte, yte):
    # standardise in train only
    scaler = StandardScaler()
    Xtr_s = scaler.fit_transform(Xtr.values)
    Xte_s = scaler.transform(Xte.values)
    m = LinearRegression().fit(Xtr_s, ytr.values)
    # test coefficient of determination
    yh = m.predict(Xte_s)
    ss_res = ((yte.values - yh)**2).sum()
    ss_tot = ((yte.values - yte.values.mean())**2).sum()
    r2 = 1 - ss_res/ss_tot if ss_tot>0 else np.nan
    return r2, m.coef_, scaler

# -------------- LOAD --------------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)

feat_book = pd.read_excel(find_file(FEATURES_FILE), sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
feat_raw = feat_book[fsheet].copy()

dcol, tcol = find_day0(feat_raw), find_ticker(feat_raw)
feat_g, feat_cols_all = group_numeric(feat_raw, dcol, tcol)

# If you want to force exactly seven features, list them in FORCE_FEATURES
if FORCE_FEATURES:
    feat_cols = [c for c in FORCE_FEATURES if c in feat_g.columns]
else:
    # use all numeric columns in v1.2 after dropping the target later
    feat_cols = feat_cols_all

print("Detected feature candidates:", feat_cols)

results_all = []

for w in WINDOWS_TO_SCORE:
    es = win_map[w]
    ev = evt_book[es].copy()
    ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
    ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
    ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])

    merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")
    X = build_X(merged, feat_cols, ycol)
    y = merged[ycol].astype(float)
    groups = merged["__tic__"]

    # If more than seven made it through, keep the seven with most variance
    if X.shape[1] > 7:
        var_rank = X.var().sort_values(ascending=False).index.tolist()
        X = X[var_rank[:7]]
    feature_list = X.columns.tolist()

    # Cross validated baseline and fold objects
    splits = grouped_splits(X, y, groups)

    # 1) Permutation importance (drop in test coefficient of determination when permuted on test)
    perm_drops = {f: [] for f in feature_list}
    # 2) Standardized coefficients (mean absolute across folds)
    coef_collection = {f: [] for f in feature_list}

    # Compute baseline per fold and permutation drops
    for tr, te in splits:
        r2_base, coef, scaler = fit_and_score(X.iloc[tr], y.iloc[tr], X.iloc[te], y.iloc[te])
        # map coefficients back to feature names (after scaling)
        for f, c in zip(feature_list, coef):
            coef_collection[f].append(abs(float(c)))

        # permutation on the test slice only
        Xte = X.iloc[te].copy()
        for f in feature_list:
            Xperm = Xte.copy()
            Xperm[f] = np.random.permutation(Xperm[f].values)  # break the link
            # reuse the same scaler and model? No. Refit on train only to keep it honest.
            r2_perm, _, _ = fit_and_score(X.iloc[tr], y.iloc[tr], Xperm, y.iloc[te])
            drop = (r2_base - r2_perm) if (r2_base is not np.nan and r2_perm is not np.nan) else np.nan
            perm_drops[f].append(drop)

    perm_mean = {f: float(np.nanmean(v)) for f, v in perm_drops.items()}
    coef_mean = {f: float(np.nanmean(v)) for f, v in coef_collection.items()}

    # 3) Leave-one-feature-out cross validated coefficient of determination drop
    # baseline cross validated coefficient of determination with all features
    def cv_r2(Xfull):
        scores=[]
        for tr, te in splits:
            r2, _, _ = fit_and_score(Xfull.iloc[tr], y.iloc[tr], Xfull.iloc[te], y.iloc[te])
            scores.append(r2)
        return float(np.nanmean(scores))

    base_cv = cv_r2(X)

    lofo_drop = {}
    for f in feature_list:
        X_minus = X.drop(columns=[f])
        lofo_drop[f] = base_cv - cv_r2(X_minus)

    # Build importance table
    imp = pd.DataFrame({
        "feature": feature_list,
        "permutation_drop_in_test_coefficient_of_determination": [perm_mean[f] for f in feature_list],
        "leave_one_out_drop_in_cross_validated_coefficient_of_determination": [lofo_drop[f] for f in feature_list],
        "mean_abs_standardized_coefficient": [coef_mean[f] for f in feature_list],
    })

    # Ranks (1 = most important)
    for col in ["permutation_drop_in_test_coefficient_of_determination",
                "leave_one_out_drop_in_cross_validated_coefficient_of_determination",
                "mean_abs_standardized_coefficient"]:
        imp[f"rank_{col}"] = imp[col].rank(ascending=False, method="min")

    imp["aggregate_rank"] = imp[[c for c in imp.columns if c.startswith("rank_")]].mean(axis=1)
    imp = imp.sort_values("aggregate_rank").reset_index(drop=True)

    print(f"\nWindow {w} — baseline cross validated coefficient of determination with all seven: {base_cv:.4f}")
    display(imp)

    out_path = find_file(EVENT_FILE).parent / f"v12_feature_importance_window_{w.replace(',','_')}.csv"
    imp.to_csv(out_path, index=False)
    print("Saved:", out_path)
Detected feature candidates: ['eps_surprise_pct', 'pre_ret_3d', 'pre_vol_5d', 'mkt_ret_5d_lag1', 'macro_us10y', 'vix_level_lag1', 'vix_chg_5d_lag1']

Window 0,5 — baseline cross validated coefficient of determination with all seven: 0.1339
feature permutation_drop_in_test_coefficient_of_determination leave_one_out_drop_in_cross_validated_coefficient_of_determination mean_abs_standardized_coefficient rank_permutation_drop_in_test_coefficient_of_determination rank_leave_one_out_drop_in_cross_validated_coefficient_of_determination rank_mean_abs_standardized_coefficient aggregate_rank
0 pre_ret_3d 0.147469 0.149779 0.022378 1.0 1.0 1.0 1.000000
1 eps_surprise_pct 0.100898 0.045600 0.017527 2.0 2.0 2.0 2.000000
2 vix_chg_5d_lag1 0.072175 0.015763 0.017332 3.0 5.0 3.0 3.666667
3 macro_us10y 0.026367 0.025868 0.010407 5.0 3.0 4.0 4.000000
4 vix_level_lag1 0.043007 0.021924 0.007735 4.0 4.0 5.0 4.333333
5 mkt_ret_5d_lag1 0.021462 -0.003507 0.003803 6.0 7.0 6.0 6.333333
6 pre_vol_5d 0.010645 0.000146 0.003343 7.0 6.0 7.0 6.666667
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\v12_feature_importance_window_0_5.csv
In [1]:
# ===== Test features_new.csv against event_study.xlsx on windows 0,1 / 0,3 / 0,5 =====
# Requirements: pandas, numpy, scikit-learn, openpyxl, matplotlib

import re, numpy as np, pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
import matplotlib.pyplot as plt

# ---- Paths (edit if your files live elsewhere) ----
DATA_DIR = Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model")
EVENT_FILE = DATA_DIR / "event_study.xlsx"
FEATURES_FILE = DATA_DIR / "features_new.csv"   # new file you just shared

WINDOWS = ["0,1","0,3","0,5"]   # test these three
MAX_GROUP_FOLDS = 5

# ---- Helpers ----
def is_readme(name: str) -> bool:
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), flags=re.IGNORECASE))

def window_sheets(book: dict):
    out = {"0,1":None,"0,3":None,"0,5":None}
    pats = {"0,1":r"(car.*)?0\D*1(?!\d)", "0,3":r"(car.*)?0\D*3(?!\d)", "0,5":r"(car.*)?0\D*5(?!\d)"}
    for nm in book:
        if is_readme(nm): 
            continue
        for w, pat in pats.items():
            if out[w] is None and re.search(pat, str(nm), re.IGNORECASE):
                out[w] = nm
    return out

def find_day0(df: pd.DataFrame) -> str:
    strict = [c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.IGNORECASE)]
    if strict: return strict[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best, kbest = None, -1
    for c in df.columns:
        k = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if k > kbest: best, kbest = c, k
    return best

def find_ticker(df: pd.DataFrame) -> str:
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best, score = None, -1
    for c in obj:
        s = df[c].astype(str).str.strip()
        sc = s.nunique() - 0.1*s.str.len().mean()
        if sc > score: best, score = c, sc
    return best

def find_target(df: pd.DataFrame) -> str:
    c = [c for c in df.columns if re.search(r"\bcar\b", str(c), re.IGNORECASE)]
    if c: return c[0]
    c = [c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.IGNORECASE)]
    return c[0] if c else None

def norm_day0(s: pd.Series):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def norm_tic(s: pd.Series): 
    return s.astype(str).str.strip().str.upper()

def group_numeric_by_day0_tic(df: pd.DataFrame, dcol: str, tcol: str):
    g = df.copy()
    g["__day0__"] = norm_day0(g[dcol])
    g["__tic__"]  = norm_tic(g[tcol])
    nums = g.select_dtypes(include=[np.number]).columns.tolist()
    g = (g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
          .dropna(subset=["__day0__","__tic__"]))
    return g, nums

def build_X(merged: pd.DataFrame, numeric_cols: list, ycol: str):
    keep = [c for c in numeric_cols if c in merged.columns]
    X = merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
    # drop constants
    nunq = X.nunique(dropna=False)
    return X.loc[:, nunq > 1]

def adjusted_r2(X: pd.DataFrame, y: pd.Series, r2_value: float):
    n, p = len(y), X.shape[1]
    if n - p - 1 <= 0: 
        return np.nan
    return 1 - (1 - r2_value) * (n - 1) / (n - p - 1)

def grouped_cv_r2(X: pd.DataFrame, y: pd.Series, groups: pd.Series, max_folds=5):
    n_groups = int(pd.Series(groups).nunique())
    if len(X) < 3:
        return np.nan
    if n_groups >= 2:
        gkf = GroupKFold(n_splits=min(max_folds, n_groups))
        splits = gkf.split(X, y, groups=groups)
    else:
        kf = KFold(n_splits=min(3, len(X)), shuffle=True, random_state=42)
        splits = kf.split(X, y)
    scores = []
    for tr, te in splits:
        m = LinearRegression().fit(X.iloc[tr].values, y.iloc[tr].values)
        yh = m.predict(X.iloc[te].values)
        yt = y.iloc[te].values
        ss_res = np.sum((yt - yh)**2)
        ss_tot = np.sum((yt - yt.mean())**2)
        scores.append(1 - ss_res/ss_tot if ss_tot>0 else np.nan)
    return float(np.nanmean(scores))

# ---- Load data ----
evt_book = pd.read_excel(EVENT_FILE, sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)

# features_new.csv can have many columns; we will detect day0 and ticker first
feat_raw = pd.read_csv(FEATURES_FILE)
dcol, tcol = find_day0(feat_raw), find_ticker(feat_raw)
features_grouped, numeric_cols = group_numeric_by_day0_tic(feat_raw, dcol, tcol)

# ---- Score per window ----
rows = []
for w in WINDOWS:
    es = win_map[w]
    if es is None:
        print(f"Could not find a sheet for window {w}. Skipping.")
        continue

    ev = evt_book[es].copy()
    ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
    ev["__day0__"] = norm_day0(ev[ed])
    ev["__tic__"]  = norm_tic(ev[et])
    ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])

    merged = features_grouped.merge(ev[["__day0__","__tic__", ycol]],
                                    on=["__day0__","__tic__"], how="inner")

    X = build_X(merged, numeric_cols, ycol)
    y = merged[ycol].astype(float)
    groups = merged["__tic__"]

    if X.shape[1] == 0 or len(y) < max(10, X.shape[1] + 2):
        rows.append({"model":"features_new.csv","window":w,"rows_used":len(y),
                     "features_used":X.shape[1],
                     "r_squared":np.nan,
                     "adjusted_r_squared":np.nan,
                     "cross_validated_r_squared":np.nan})
        continue

    # In-sample
    lr = LinearRegression().fit(X.values, y.values)
    r2_in = float(lr.score(X.values, y.values))
    adj_in = float(adjusted_r2(X, y, r2_in))

    # Out-of-sample (grouped)
    cv_out = grouped_cv_r2(X, y, groups, MAX_GROUP_FOLDS)

    rows.append({
        "model":"features_new.csv",
        "window": w,
        "rows_used": len(y),
        "features_used": X.shape[1],
        "r_squared": r2_in,
        "adjusted_r_squared": adj_in,
        "cross_validated_r_squared": cv_out
    })

results = pd.DataFrame(rows)
display(results)

# Save
out_csv = DATA_DIR / "features_new_metrics.csv"
results.to_csv(out_csv, index=False)
print("Saved:", out_csv)

# ---- Quick bars ----
def make_bar(metric, title):
    fig = plt.figure(figsize=(8,5))
    vals = [float(results.loc[results.window==w, metric]) if (results.window==w).any() else np.nan for w in WINDOWS]
    plt.bar(range(len(WINDOWS)), vals)
    plt.xticks(range(len(WINDOWS)), WINDOWS)
    plt.ylabel(title)
    plt.title(f"{title} — features_new.csv")
    plt.tight_layout()
    plt.show()

make_bar("adjusted_r_squared", "Adjusted coefficient of determination")
make_bar("cross_validated_r_squared", "Cross validated coefficient of determination")
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\3626808198.py:63: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  a = pd.to_datetime(s, errors="coerce").dt.normalize()
model window rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared
0 features_new.csv 0,1 129 27 0.832782 0.788080 0.580178
1 features_new.csv 0,3 129 27 0.743844 0.675366 0.469311
2 features_new.csv 0,5 129 27 0.705773 0.627118 0.375565
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\3626808198.py:179: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  vals = [float(results.loc[results.window==w, metric]) if (results.window==w).any() else np.nan for w in WINDOWS]
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\features_new_metrics.csv
No description has been provided for this image
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\3626808198.py:179: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  vals = [float(results.loc[results.window==w, metric]) if (results.window==w).any() else np.nan for w in WINDOWS]
No description has been provided for this image
In [3]:
# === Which features push cross validated R squared up or down (27-feature set) ===
# Paths
from pathlib import Path
DATA_DIR = Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model")
EVENT_FILE = DATA_DIR / "event_study.xlsx"
FEATURES_FILE = DATA_DIR / "features_new.csv"   # your file with ~27 features
WINDOWS = ["0,5"]  # change to ["0,1","0,3","0,5"] if you want all

# ------------------- Imports -------------------
import re, numpy as np, pandas as pd
from sklearn.model_selection import GroupKFold, KFold
from sklearn.linear_model import LinearRegression

# ------------------- Small helpers -------------------
def is_readme(name): 
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))

def window_sheets(book):
    out = {"0,1":None,"0,3":None,"0,5":None}
    pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
    for nm in book:
        if is_readme(nm): continue
        for w,pat in pats.items():
            if out[w] is None and re.search(pat, str(nm), re.I): out[w]=nm
    return out

def find_day0(df):
    s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
    if s: return s[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best,k=None,-1
    for c in df.columns:
        kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if kk>k: best,k=c,kk
    return best

def find_ticker(df):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best,score=None,-1
    for c in obj:
        s=df[c].astype(str).str.strip()
        sc=s.nunique() - 0.1*s.str.len().mean()
        if sc>score: best,score=c,sc
    return best

def find_target(df):
    c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
    if c: return c[0]
    c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
    return c[0] if c else None

def norm_day0(s):
    a=pd.to_datetime(s, errors="coerce").dt.normalize()
    b=pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def norm_tic(s): 
    return s.astype(str).str.strip().str.upper()

def group_numeric_by_keys(df, dcol, tcol):
    g=df.copy()
    g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
    nums=g.select_dtypes(include=[np.number]).columns.tolist()
    g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
         .dropna(subset=["__day0__","__tic__"]))
    return g, nums

def adjusted_r2(n, p, r2):
    return np.nan if n-p-1<=0 else 1 - (1-r2)*(n-1)/(n-p-1)

# Winsorise and standardise inside train only
def fit_transformers(Xtr):
    stats={}
    Xw=Xtr.copy()
    for c in Xw.columns:
        lo, hi = np.nanpercentile(Xw[c].values, [1,99])
        stats[c] = {"lo":float(lo), "hi":float(hi),
                    "mean":float(np.nanmean(Xw[c].clip(lo,hi))),
                    "std": float(np.nanstd(Xw[c].clip(lo,hi), ddof=0))}
        Xw[c] = Xw[c].clip(lo,hi)
        sd = stats[c]["std"] or 1.0
        Xw[c] = (Xw[c] - stats[c]["mean"]) / sd
    return stats, Xw

def apply_transformers(Xte, stats):
    Xw=Xte.copy()
    for c in Xw.columns:
        if c not in stats: continue
        lo,hi,mu,sd = stats[c]["lo"], stats[c]["hi"], stats[c]["mean"], stats[c]["std"] or 1.0
        Xw[c]=((Xw[c].clip(lo,hi) - mu) / sd)
    return Xw

def grouped_splits(X, y, groups, max_folds=5):
    ng=int(pd.Series(groups).nunique())
    if ng>=2:
        return list(GroupKFold(n_splits=min(max_folds, ng)).split(X,y,groups))
    k=min(3, len(X))
    return list(KFold(n_splits=k, shuffle=True, random_state=42).split(X,y))

def fold_score(Xtr, ytr, Xte, yte):
    stats, Xtr_s = fit_transformers(Xtr)
    Xte_s = apply_transformers(Xte, stats)
    m=LinearRegression().fit(Xtr_s.values, ytr.values)
    pred=m.predict(Xte_s.values)
    ss_res=np.sum((yte.values - pred)**2)
    ss_tot=np.sum((yte.values - yte.values.mean())**2)
    return (1 - ss_res/ss_tot) if ss_tot>0 else np.nan, m, stats

def cv_r2(X, y, groups, splits):
    scores=[]
    for tr,te in splits:
        r2,_,_ = fold_score(X.iloc[tr], y.iloc[tr], X.iloc[te], y.iloc[te])
        scores.append(r2)
    return float(np.nanmean(scores))

# ------------------- Load -------------------
evt_book = pd.read_excel(EVENT_FILE, sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)

feat_raw = pd.read_csv(FEATURES_FILE)
dcol, tcol = find_day0(feat_raw), find_ticker(feat_raw)
feat_g, numeric_cols = group_numeric_by_keys(feat_raw, dcol, tcol)

# ------------------- Run per window -------------------
all_summaries = []

for w in WINDOWS:
    es = win_map[w]
    if es is None:
        print(f"Window {w}: not found in event_study.xlsx; skipping.")
        continue

    ev = evt_book[es].copy()
    ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
    ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
    ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])

    merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")

    # Build X and y
    X = merged[numeric_cols].drop(columns=[c for c in ["CAR","car",ycol] if c in numeric_cols], errors="ignore")
    # Drop constant columns
    nunq = X.nunique(dropna=False)
    X = X.loc[:, nunq > 1]
    y = merged[ycol].astype(float)
    groups = merged["__tic__"]

    # Cross validated baseline
    splits = grouped_splits(X, y, groups, max_folds=5)
    base_cv = cv_r2(X, y, groups, splits)

    # Leave-one-feature-out delta (positive = helpful; negative = harmful)
    lofo = {}
    for f in X.columns:
        cv_without = cv_r2(X.drop(columns=[f]), y, groups, splits)
        lofo[f] = base_cv - cv_without

    # Permutation drop in test per fold (bigger drop = more important)
    perm = {f: [] for f in X.columns}
    for tr, te in splits:
        r2_base, m, stats = fold_score(X.iloc[tr], y.iloc[tr], X.iloc[te], y.iloc[te])
        if np.isnan(r2_base): 
            for f in X.columns: perm[f].append(np.nan)
            continue
        Xte = X.iloc[te].copy()
        for f in X.columns:
            Xperm = Xte.copy()
            Xperm[f] = np.random.permutation(Xperm[f].values)
            # score again using same training split
            r2_perm, _, _ = fold_score(X.iloc[tr], y.iloc[tr], Xperm, y.iloc[te])
            perm[f].append(r2_base - r2_perm)

    perm_mean = {f: float(np.nanmean(v)) for f,v in perm.items()}

    imp = pd.DataFrame({
        "feature": X.columns,
        "delta_cv_r_squared_when_dropped": [lofo[f] for f in X.columns],
        "permutation_drop_in_test_r_squared": [perm_mean[f] for f in X.columns]
    })
    # Rankings (1 = most helpful/important)
    imp["rank_lofo"] = imp["delta_cv_r_squared_when_dropped"].rank(ascending=False, method="min")
    imp["rank_perm"] = imp["permutation_drop_in_test_r_squared"].rank(ascending=False, method="min")
    imp["aggregate_rank"] = imp[["rank_lofo","rank_perm"]].mean(axis=1)
    imp = imp.sort_values("aggregate_rank").reset_index(drop=True)

    # Labels for action
    imp["action"] = np.where(
        imp["delta_cv_r_squared_when_dropped"] < 0,
        "candidate_to_drop (model improves when removed)",
        "keep_or_review"
    )

    out_csv = DATA_DIR / f"feature_impact_window_{w.replace(',','_')}.csv"
    imp.to_csv(out_csv, index=False)

    print(f"\n=== Window {w} ===")
    print(f"Baseline cross validated R squared with all features: {base_cv:.4f}")
    print("\nTop 10 to KEEP (largest positive delta when dropped and large permutation drop):")
    display(imp.sort_values(["delta_cv_r_squared_when_dropped","permutation_drop_in_test_r_squared"], ascending=False).head(10))
    print("\nTop 10 to DROP (negative delta when dropped and low/negative permutation drop):")
    display(imp.sort_values(["delta_cv_r_squared_when_dropped","permutation_drop_in_test_r_squared"], ascending=[True, True]).head(10))
    print("Saved:", out_csv)

    all_summaries.append(imp.assign(window=w))

# Combined table (if you ran more than one window)
if all_summaries:
    combined = pd.concat(all_summaries, ignore_index=True)
    combined_path = DATA_DIR / "feature_impact_all_windows.csv"
    combined.to_csv(combined_path, index=False)
    print("Combined table saved:", combined_path)
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\795902831.py:57: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  a=pd.to_datetime(s, errors="coerce").dt.normalize()
=== Window 0,5 ===
Baseline cross validated R squared with all features: 0.3431

Top 10 to KEEP (largest positive delta when dropped and large permutation drop):
feature delta_cv_r_squared_when_dropped permutation_drop_in_test_r_squared rank_lofo rank_perm aggregate_rank action
0 gap_proxy_dm1_to_d0 6.613756e-01 1.588933 1.0 3.0 2.0 keep_or_review
7 pre_ret_3d 3.735284e-02 0.038858 2.0 17.0 9.5 keep_or_review
6 mkt_ret_1d_lag1 1.763603e-02 0.040034 3.0 16.0 9.5 keep_or_review
13 pre_ret_10d 1.043617e-02 0.007743 4.0 21.0 12.5 keep_or_review
15 vix_level_lag1 6.095099e-03 -0.011713 5.0 27.0 16.0 keep_or_review
3 pre_ret_5d 1.741542e-03 0.170232 6.0 7.0 6.5 keep_or_review
1 credit_moody_baa_yield_pct 1.669870e-03 13.515042 7.0 1.0 4.0 keep_or_review
2 credit_moody_aaa_yield_pct 1.061318e-03 11.991609 8.0 2.0 5.0 keep_or_review
4 credit_baa_minus_aaa_bp 0.000000e+00 0.132807 9.0 8.0 8.5 keep_or_review
5 credit_investment_grade_option_adjusted_spread... -1.665335e-16 0.132769 10.0 9.0 9.5 candidate_to_drop (model improves when removed)
Top 10 to DROP (negative delta when dropped and low/negative permutation drop):
feature delta_cv_r_squared_when_dropped permutation_drop_in_test_r_squared rank_lofo rank_perm aggregate_rank action
26 macro_fedfunds -0.032414 0.021119 27.0 19.0 23.0 candidate_to_drop (model improves when removed)
23 mkt_ret_5d_lag1 -0.031801 0.037368 26.0 18.0 22.0 candidate_to_drop (model improves when removed)
18 mkt_ret_10d_lag1 -0.028707 0.046340 25.0 14.0 19.5 candidate_to_drop (model improves when removed)
25 pre_vol_3d -0.020214 0.005819 24.0 22.0 23.0 candidate_to_drop (model improves when removed)
21 eps_surprise_pct -0.020014 0.016920 23.0 20.0 21.5 candidate_to_drop (model improves when removed)
17 pre_vol_5d -0.016368 0.050449 22.0 13.0 17.5 candidate_to_drop (model improves when removed)
16 macro_cpi_yoy -0.015780 0.128479 21.0 11.0 16.0 candidate_to_drop (model improves when removed)
24 vix_chg_10d_lag1 -0.008765 0.000686 20.0 24.0 22.0 candidate_to_drop (model improves when removed)
20 vix_chg_5d_lag1 -0.008057 0.001458 19.0 23.0 21.0 candidate_to_drop (model improves when removed)
22 pre_vol_10d -0.007810 -0.002316 18.0 25.0 21.5 candidate_to_drop (model improves when removed)
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\feature_impact_window_0_5.csv
Combined table saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\feature_impact_all_windows.csv
In [5]:
# === Compare v2.1.csv vs v2.2.csv vs v2.3.csv on CAR (0,1) (0,3) (0,5) ===
# Requirements: pandas, numpy, scikit-learn, openpyxl, matplotlib

import re, numpy as np, pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
import matplotlib.pyplot as plt

# ---------- Paths (tries your Windows folder first, then /mnt/data) ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
    Path("/mnt/data"),
    Path(".")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["v2.1.csv", "v2.2.csv", "v2.3.csv"]  # put more here if needed
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5

def find_file(name):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(name)

# ---------- Helpers ----------
def is_readme(name): 
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))

def window_sheets(book):
    out = {"0,1":None,"0,3":None,"0,5":None}
    pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
    for nm in book:
        if is_readme(nm): continue
        for w,pat in pats.items():
            if out[w] is None and re.search(pat, str(nm), re.I):
                out[w] = nm
    return out

def find_day0(df):
    s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
    if s: return s[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
        if c in df.columns: return c
    # fallback: most date-like
    best,k=None,-1
    for c in df.columns:
        kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if kk>k: best,k=c,kk
    return best

def find_ticker(df):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best,score=None,-1
    for c in obj:
        s=df[c].astype(str).str.strip()
        sc=s.nunique() - 0.1*s.str.len().mean()
        if sc>score: best,score=c,sc
    return best

def find_target(df):
    c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
    if c: return c[0]
    c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
    return c[0] if c else None

def norm_day0(s):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def norm_tic(s): 
    return s.astype(str).str.strip().str.upper()

def group_numeric(df, dcol, tcol):
    g=df.copy()
    g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
    nums=g.select_dtypes(include=[np.number]).columns.tolist()
    g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
         .dropna(subset=["__day0__","__tic__"]))
    return g, nums

def build_X(merged, numeric_cols, ycol):
    keep=[c for c in numeric_cols if c in merged.columns]
    X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
    nunq=X.nunique(dropna=False)
    return X.loc[:, nunq>1]

def adjusted_r2(X, y, r2_value):
    n, p = len(y), X.shape[1]
    if n-p-1 <= 0: return np.nan
    return 1 - (1 - r2_value) * (n - 1) / (n - p - 1)

def grouped_cv_r2(X, y, groups):
    n_groups = int(pd.Series(groups).nunique())
    if n_groups >= 2:
        gkf = GroupKFold(n_splits=min(MAX_GROUP_FOLDS, n_groups))
        splits = gkf.split(X, y, groups=groups)
    else:
        splits = KFold(n_splits=min(3,len(X)), shuffle=True, random_state=42).split(X,y)
    scores=[]
    for tr, te in splits:
        m = LinearRegression().fit(X.iloc[tr].values, y.iloc[tr].values)
        yh = m.predict(X.iloc[te].values); yt = y.iloc[te].values
        ss_res = np.sum((yt - yh)**2); ss_tot = np.sum((yt - yt.mean())**2)
        scores.append(1 - ss_res/ss_tot if ss_tot>0 else np.nan)
    return float(np.nanmean(scores))

# ---------- Load event study and map windows ----------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)

# ---------- Score each features file on each window ----------
rows = []
for f in FEATURE_FILES:
    fpath = find_file(f)
    feat_raw = pd.read_csv(fpath)
    dcol, tcol = find_day0(feat_raw), find_ticker(feat_raw)
    feat_g, numeric_cols = group_numeric(feat_raw, dcol, tcol)

    for w in WINDOWS:
        es = win_map[w]
        if es is None:
            print(f"[{f}] window {w} sheet not found. Skipping.")
            continue

        ev = evt_book[es].copy()
        ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
        ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
        ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])

        merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")

        X = build_X(merged, numeric_cols, ycol)
        y = merged[ycol].astype(float)
        groups = merged["__tic__"]

        if X.shape[1] == 0 or len(y) < max(10, X.shape[1] + 2):
            rows.append({"model":f, "window":w, "rows_used":len(y), "features_used":X.shape[1],
                         "r_squared":np.nan, "adjusted_r_squared":np.nan,
                         "cross_validated_r_squared":np.nan})
            continue

        lr = LinearRegression().fit(X.values, y.values)
        r2_in = float(lr.score(X.values, y.values))
        adj_in = float(adjusted_r2(X, y, r2_in))
        cv_out = grouped_cv_r2(X, y, groups)

        rows.append({"model":f, "window":w, "rows_used":len(y), "features_used":X.shape[1],
                     "r_squared":r2_in, "adjusted_r_squared":adj_in,
                     "cross_validated_r_squared":cv_out})

# ---------- Results table ----------
results = pd.DataFrame(rows)
results = results.sort_values(["window","cross_validated_r_squared"], ascending=[True, False]).reset_index(drop=True)
display(results)

# Save
out_path = find_file(EVENT_FILE).parent / "v2_models_metrics.csv"
results.to_csv(out_path, index=False)
print("Saved:", out_path)

# ---------- Quick visual: cross validated R squared by window ----------
plt.figure(figsize=(9,5))
for f in FEATURE_FILES:
    sub = results[results.model==f]
    plt.plot(sub["window"], sub["cross_validated_r_squared"], marker="o", label=f)
plt.title("Cross validated R squared — v2 models")
plt.xlabel("Window"); plt.ylabel("Cross validated R squared")
plt.legend(); plt.tight_layout(); plt.show()

# Also show the winner per window
best = results.sort_values(["window","cross_validated_r_squared"], ascending=[True,False])\
              .groupby("window").head(1).reset_index(drop=True)
print("\nBest model per window (by cross validated R squared):")
display(best[["window","model","cross_validated_r_squared","adjusted_r_squared","rows_used","features_used"]])
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\1394943956.py:72: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  a = pd.to_datetime(s, errors="coerce").dt.normalize()
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\1394943956.py:72: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  a = pd.to_datetime(s, errors="coerce").dt.normalize()
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\1394943956.py:72: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  a = pd.to_datetime(s, errors="coerce").dt.normalize()
model window rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared
0 v2.1.csv 0,1 129 7 0.799498 0.787898 0.707091
1 v2.2.csv 0,1 129 6 0.779788 0.768958 0.616111
2 v2.3.csv 0,1 129 13 0.804145 0.782005 0.575099
3 v2.1.csv 0,3 129 7 0.716792 0.700408 0.630746
4 v2.2.csv 0,3 129 6 0.709632 0.695351 0.576169
5 v2.3.csv 0,3 129 13 0.730299 0.699811 0.541791
6 v2.1.csv 0,5 129 7 0.681290 0.662853 0.556400
7 v2.2.csv 0,5 129 6 0.666807 0.650420 0.491290
8 v2.3.csv 0,5 129 13 0.699674 0.665724 0.480121
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\v2_models_metrics.csv
No description has been provided for this image
Best model per window (by cross validated R squared):
window model cross_validated_r_squared adjusted_r_squared rows_used features_used
0 0,1 v2.1.csv 0.707091 0.787898 129 7
1 0,3 v2.1.csv 0.630746 0.700408 129 7
2 0,5 v2.1.csv 0.556400 0.662853 129 7
In [7]:
# === Compare v2.1.1.csv vs v2.1.csv on CAR_(0,1)/(0,3)/(0,5) ===
# Needs: pandas, numpy, scikit-learn, openpyxl, matplotlib

import re, numpy as np, pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
import matplotlib.pyplot as plt

# ----- Paths (edit if needed) -----
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
    Path("/mnt/data"),
    Path(".")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["v2.1.1.csv", "v2.1.csv"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5

def find_file(name):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(name)

# ----- Helpers -----
def is_readme(name): 
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))

def window_sheets(book):
    out = {"0,1":None,"0,3":None,"0,5":None}
    pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
    for nm in book:
        if is_readme(nm): continue
        for w,pat in pats.items():
            if out[w] is None and re.search(pat, str(nm), re.I): out[w]=nm
    return out

def find_day0(df):
    s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
    if s: return s[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best,k=None,-1
    for c in df.columns:
        kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if kk>k: best,k=c,kk
    return best

def find_ticker(df):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best,score=None,-1
    for c in obj:
        s=df[c].astype(str).str.strip()
        sc=s.nunique() - 0.1*s.str.len().mean()
        if sc>score: best,score=c,sc
    return best

def find_target(df):
    c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
    if c: return c[0]
    c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
    return c[0] if c else None

def norm_day0(s):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def norm_tic(s): 
    return s.astype(str).str.strip().str.upper()

def group_numeric(df, dcol, tcol):
    g=df.copy()
    g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
    nums=g.select_dtypes(include=[np.number]).columns.tolist()
    g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
         .dropna(subset=["__day0__","__tic__"]))
    return g, nums

def build_X(merged, numeric_cols, ycol):
    keep=[c for c in numeric_cols if c in merged.columns]
    X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
    nunq=X.nunique(dropna=False)
    return X.loc[:, nunq>1]

def adjusted_r2(X, y, r2_value):
    n, p = len(y), X.shape[1]
    if n-p-1 <= 0: return np.nan
    return 1 - (1 - r2_value) * (n - 1) / (n - p - 1)

def grouped_cv_r2(X, y, groups, max_folds=5):
    n_groups = int(pd.Series(groups).nunique())
    if len(X) < 3: return np.nan
    if n_groups >= 2:
        splits = GroupKFold(n_splits=min(max_folds, n_groups)).split(X, y, groups=groups)
    else:
        splits = KFold(n_splits=min(3,len(X)), shuffle=True, random_state=42).split(X, y)
    scores=[]
    for tr, te in splits:
        m = LinearRegression().fit(X.iloc[tr].values, y.iloc[tr].values)
        yh = m.predict(X.iloc[te].values); yt = y.iloc[te].values
        ss_res = np.sum((yt - yh)**2); ss_tot = np.sum((yt - yt.mean())**2)
        scores.append(1 - ss_res/ss_tot if ss_tot>0 else np.nan)
    return float(np.nanmean(scores))

# ----- Load event study -----
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)

# ----- Score both files -----
rows = []
for f in FEATURE_FILES:
    fpath = find_file(f)
    feat_raw = pd.read_csv(fpath)
    dcol, tcol = find_day0(feat_raw), find_ticker(feat_raw)
    feat_g, numeric_cols = group_numeric(feat_raw, dcol, tcol)

    for w in WINDOWS:
        es = win_map[w]
        if es is None:
            print(f"[{f}] window {w} sheet not found. Skipping.")
            continue

        ev = evt_book[es].copy()
        ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
        ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
        ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])

        merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")

        X = build_X(merged, numeric_cols, ycol)
        y = merged[ycol].astype(float)
        groups = merged["__tic__"]

        if X.shape[1] == 0 or len(y) < max(10, X.shape[1] + 2):
            rows.append({"model":f, "window":w, "rows_used":len(y), "features_used":X.shape[1],
                         "r_squared":np.nan, "adjusted_r_squared":np.nan,
                         "cross_validated_r_squared":np.nan})
            continue

        lr = LinearRegression().fit(X.values, y.values)
        r2_in = float(lr.score(X.values, y.values))
        adj_in = float(adjusted_r2(X, y, r2_in))
        cv_out = grouped_cv_r2(X, y, groups, MAX_GROUP_FOLDS)

        rows.append({"model":f, "window":w, "rows_used":len(y), "features_used":X.shape[1],
                     "r_squared":r2_in, "adjusted_r_squared":adj_in,
                     "cross_validated_r_squared":cv_out})

# ----- Results table -----
results = pd.DataFrame(rows).sort_values(["window","cross_validated_r_squared"], ascending=[True, False]).reset_index(drop=True)
display(results)

# Save
out_path = find_file(EVENT_FILE).parent / "v2_1_1_vs_v2_1_metrics.csv"
results.to_csv(out_path, index=False)
print("Saved:", out_path)

# ----- Quick plot: cross validated R^2 by window -----
plt.figure(figsize=(8,5))
for f in FEATURE_FILES:
    sub = results[results.model==f]
    plt.plot(sub["window"], sub["cross_validated_r_squared"], marker="o", label=f)
plt.title("Cross validated R squared — v2.1.1 vs v2.1")
plt.xlabel("Window"); plt.ylabel("Cross validated R squared")
plt.legend(); plt.tight_layout(); plt.show()

# Winner per window
best = results.sort_values(["window","cross_validated_r_squared"], ascending=[True,False]).groupby("window").head(1)
print("\nBest per window:")
display(best[["window","model","cross_validated_r_squared","adjusted_r_squared","rows_used","features_used"]])
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\832927390.py:70: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  a = pd.to_datetime(s, errors="coerce").dt.normalize()
C:\Users\dcazo\AppData\Local\Temp\ipykernel_4324\832927390.py:70: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  a = pd.to_datetime(s, errors="coerce").dt.normalize()
model window rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared
0 v2.1.csv 0,1 129 7 0.799498 0.787898 0.707091
1 v2.1.1.csv 0,1 129 6 0.185055 0.144976 0.042623
2 v2.1.csv 0,3 129 7 0.716792 0.700408 0.630746
3 v2.1.1.csv 0,3 129 6 0.142172 0.099984 0.046614
4 v2.1.csv 0,5 129 7 0.681290 0.662853 0.556400
5 v2.1.1.csv 0,5 129 6 0.149269 0.107430 0.073914
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\v2_1_1_vs_v2_1_metrics.csv
No description has been provided for this image
Best per window:
window model cross_validated_r_squared adjusted_r_squared rows_used features_used
0 0,1 v2.1.csv 0.707091 0.787898 129 7
2 0,3 v2.1.csv 0.630746 0.700408 129 7
4 0,5 v2.1.csv 0.556400 0.662853 129 7
In [1]:
# === Compare v1.2.xlsx vs v1.3.xlsx on CAR (0,1) (0,3) (0,5) ===
# Needs: pandas, numpy, scikit-learn, openpyxl, matplotlib

import re, numpy as np, pandas as pd
from pathlib import Path
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GroupKFold, KFold
import matplotlib.pyplot as plt

# ---------- Paths (tries your Windows folder first, then /mnt/data) ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
    Path("/mnt/data"),
    Path(".")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["v1.2.xlsx", "v1.3.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]
MAX_GROUP_FOLDS = 5

def find_file(name):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(name)

# ---------- Helpers ----------
def is_readme(name): 
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))

def window_sheets(book):
    out = {"0,1":None,"0,3":None,"0,5":None}
    pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
    for nm in book:
        if is_readme(nm): continue
        for w,pat in pats.items():
            if out[w] is None and re.search(pat, str(nm), re.I): out[w]=nm
    return out

def choose_features_sheet(book):
    cands = [(n, df) for n, df in book.items() if not is_readme(n)]
    if not cands: return next(iter(book))
    def score(x):
        _, df = x
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_day0(df):
    s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
    if s: return s[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best,k=None,-1
    for c in df.columns:
        kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if kk>k: best,k=c,kk
    return best

def find_ticker(df):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best,score=None,-1
    for c in obj:
        s=df[c].astype(str).str.strip()
        sc=s.nunique() - 0.1*s.str.len().mean()
        if sc>score: best,score=c,sc
    return best

def find_target(df):
    c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
    if c: return c[0]
    c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
    return c[0] if c else None

def norm_day0(s):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def norm_tic(s): 
    return s.astype(str).str.strip().str.upper()

def group_numeric(df, dcol, tcol):
    g=df.copy()
    g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
    nums=g.select_dtypes(include=[np.number]).columns.tolist()
    g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
         .dropna(subset=["__day0__","__tic__"]))
    return g, nums

def build_X(merged, numeric_cols, ycol):
    keep=[c for c in numeric_cols if c in merged.columns]
    X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
    nunq=X.nunique(dropna=False)
    return X.loc[:, nunq>1]

def adjusted_r2(X, y, r2_value):
    n, p = len(y), X.shape[1]
    if n-p-1 <= 0: return np.nan
    return 1 - (1 - r2_value) * (n - 1) / (n - p - 1)

def grouped_cv_r2(X, y, groups):
    n_groups = int(pd.Series(groups).nunique())
    if len(X) < 3: return np.nan
    if n_groups >= 2:
        splits = GroupKFold(n_splits=min(MAX_GROUP_FOLDS, n_groups)).split(X, y, groups=groups)
    else:
        splits = KFold(n_splits=min(3,len(X)), shuffle=True, random_state=42).split(X, y)
    scores=[]
    for tr, te in splits:
        m = LinearRegression().fit(X.iloc[tr].values, y.iloc[tr].values)
        yh = m.predict(X.iloc[te].values); yt = y.iloc[te].values
        ss_res = np.sum((yt - yh)**2); ss_tot = np.sum((yt - yt.mean())**2)
        scores.append(1 - ss_res/ss_tot if ss_tot>0 else np.nan)
    return float(np.nanmean(scores))

# ---------- Load event study and map windows ----------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)

# ---------- Score each file on each window ----------
rows = []
for f in FEATURE_FILES:
    fpath = find_file(f)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    raw = feat_book[fsheet].copy()

    dcol, tcol = find_day0(raw), find_ticker(raw)
    feat_g, numeric_cols = group_numeric(raw, dcol, tcol)

    for w in WINDOWS:
        es = win_map[w]
        if es is None:
            print(f"[{f}] window {w} sheet not found. Skipping.")
            continue

        ev = evt_book[es].copy()
        ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
        ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
        ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])

        merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner")

        X = build_X(merged, numeric_cols, ycol)
        y = merged[ycol].astype(float)
        groups = merged["__tic__"]

        if X.shape[1] == 0 or len(y) < max(10, X.shape[1] + 2):
            rows.append({"model":f, "window":w, "rows_used":len(y), "features_used":X.shape[1],
                         "r_squared":np.nan, "adjusted_r_squared":np.nan,
                         "cross_validated_r_squared":np.nan})
            continue

        lr = LinearRegression().fit(X.values, y.values)
        r2_in = float(lr.score(X.values, y.values))
        adj_in = float(adjusted_r2(X, y, r2_in))
        cv_out = grouped_cv_r2(X, y, groups)

        rows.append({"model":f, "window":w, "rows_used":len(y), "features_used":X.shape[1],
                     "r_squared":r2_in, "adjusted_r_squared":adj_in,
                     "cross_validated_r_squared":cv_out})

# ---------- Results table ----------
results = pd.DataFrame(rows)
results = results.sort_values(["window","cross_validated_r_squared"], ascending=[True, False]).reset_index(drop=True)
display(results)

# Save
out_path = find_file(EVENT_FILE).parent / "v1_2_vs_v1_3_metrics.csv"
results.to_csv(out_path, index=False)
print("Saved:", out_path)

# ---------- Quick plot: cross validated coefficient of determination by window ----------
plt.figure(figsize=(8,5))
for f in FEATURE_FILES:
    sub = results[results.model==f]
    plt.plot(sub["window"], sub["cross_validated_r_squared"], marker="o", label=f)
plt.title("Cross validated coefficient of determination — v1.2 vs v1.3")
plt.xlabel("Window"); plt.ylabel("Cross validated coefficient of determination")
plt.legend(); plt.tight_layout(); plt.show()

# Winner per window
best = (results.sort_values(["window","cross_validated_r_squared"], ascending=[True,False])
               .groupby("window").head(1).reset_index(drop=True))
print("\nBest per window:")
display(best[["window","model","cross_validated_r_squared","adjusted_r_squared","rows_used","features_used"]])
model window rows_used features_used r_squared adjusted_r_squared cross_validated_r_squared
0 v1.2.xlsx 0,1 129 7 0.245005 0.201328 0.072892
1 v1.3.xlsx 0,1 129 2 0.148360 0.134842 -0.018751
2 v1.2.xlsx 0,3 129 7 0.201430 0.155231 0.099199
3 v1.3.xlsx 0,3 129 2 0.103903 0.089679 -0.002105
4 v1.2.xlsx 0,5 129 7 0.214615 0.169179 0.133885
5 v1.3.xlsx 0,5 129 2 0.116031 0.102000 0.043618
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\v1_2_vs_v1_3_metrics.csv
No description has been provided for this image
Best per window:
window model cross_validated_r_squared adjusted_r_squared rows_used features_used
0 0,1 v1.2.xlsx 0.072892 0.201328 129 7
1 0,3 v1.2.xlsx 0.099199 0.155231 129 7
2 0,5 v1.2.xlsx 0.133885 0.169179 129 7
In [7]:
# === FIXED: Time-aware cross validation for v1.2.xlsx vs v1.3.xlsx (0,1 / 0,3 / 0,5) ===
# Safe when some windows produce no valid time splits.

from pathlib import Path
import re, numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# ---------------- CONFIG ----------------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
    Path("/mnt/data"),
    Path(".")
]
EVENT_FILE = "event_study.xlsx"
FEATURE_FILES = ["v1.2.xlsx", "v1.3.xlsx"]
WINDOWS = ["0,1","0,3","0,5"]

MIN_TRAIN_QUARTERS = 4
BLOCK_SAME_TICKERS = True
WINSOR_PCTS = (1, 99)
SAVE_PLOTS = True

# -------------- HELPERS --------------
def find_file(name):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(name)

def is_readme(name): 
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))

def choose_features_sheet(book):
    cands = [(n, df) for n, df in book.items() if not is_readme(n)]
    if not cands: return next(iter(book))
    def score(item):
        _, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def window_sheets(book):
    out = {"0,1":None,"0,3":None,"0,5":None}
    pats = {"0,1":r"(car.*)?0\D*1(?!\d)", "0,3":r"(car.*)?0\D*3(?!\d)", "0,5":r"(car.*)?0\D*5(?!\d)"}
    for nm in book:
        if is_readme(nm): continue
        for w, pat in pats.items():
            if out[w] is None and re.search(pat, str(nm), re.IGNORECASE):
                out[w] = nm
    return out

def find_day0(df):
    s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
    if s: return s[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best,k=None,-1
    for c in df.columns:
        kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if kk>k: best,k=c,kk
    return best

def find_ticker(df):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best,score=None,-1
    for c in obj:
        s=df[c].astype(str).str.strip()
        sc=s.nunique() - 0.1*s.str.len().mean()
        if sc>score: best,score=c,sc
    return best

def find_target(df):
    c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
    if c: return c[0]
    c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
    return c[0] if c else None

def norm_day0(s):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def norm_tic(s): 
    return s.astype(str).str.strip().str.upper()

def group_numeric(df, dcol, tcol):
    g=df.copy()
    g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
    nums=g.select_dtypes(include=[np.number]).columns.tolist()
    g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
         .dropna(subset=["__day0__","__tic__"]))
    return g, nums

def build_X(merged, numeric_cols, ycol):
    keep=[c for c in numeric_cols if c in merged.columns]
    X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
    nunq=X.nunique(dropna=False)
    return X.loc[:, nunq>1]

def to_quarter(s):
    d = pd.to_datetime(s, errors="coerce")
    return d.dt.to_period("Q").astype(str)

def fit_transformers(Xtr, lo=1, hi=99):
    stats={}
    Xw=Xtr.copy()
    for c in Xw.columns:
        lo_v, hi_v = np.nanpercentile(Xw[c].values, [lo, hi])
        clamped = Xw[c].clip(lo_v, hi_v)
        mu = float(np.nanmean(clamped))
        sd = float(np.nanstd(clamped, ddof=0)) or 1.0
        stats[c] = {"lo": float(lo_v), "hi": float(hi_v), "mu": mu, "sd": sd}
        Xw[c] = (clamped - mu) / sd
    return stats, Xw

def apply_transformers(Xte, stats):
    Xw=Xte.copy()
    for c in Xw.columns:
        if c not in stats: continue
        lo_v, hi_v, mu, sd = stats[c]["lo"], stats[c]["hi"], stats[c]["mu"], stats[c]["sd"]
        Xw[c] = (Xw[c].clip(lo_v, hi_v) - mu) / sd
    return Xw

def test_r2_on_split(Xtr, ytr, Xte, yte):
    stats, Xtr_s = fit_transformers(Xtr, lo=WINSOR_PCTS[0], hi=WINSOR_PCTS[1])
    Xte_s = apply_transformers(Xte, stats)
    m = LinearRegression().fit(Xtr_s.values, ytr.values)
    yh = m.predict(Xte_s.values)
    ss_res = np.sum((yte.values - yh)**2)
    ss_tot = np.sum((yte.values - yte.values.mean())**2)
    return (1 - ss_res/ss_tot) if ss_tot>0 else np.nan

def time_splits(df, min_train_quarters=4):
    q = to_quarter(df["__day0__"])
    uniq = pd.Index(q.unique()).sort_values()
    splits=[]
    for k in range(min_train_quarters, len(uniq)):
        train_q = set(uniq[:k])
        test_q  = {uniq[k]}
        tr_idx = q.isin(train_q).values
        te_idx = q.isin(test_q).values
        splits.append((np.where(tr_idx)[0], np.where(te_idx)[0], uniq[k]))
    return splits

# ---------------- LOAD ----------------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
out_dir = find_file(EVENT_FILE).parent

all_rows = []
per_split_rows = []

for f in FEATURE_FILES:
    fpath = find_file(f)
    feat_book = pd.read_excel(fpath, sheet_name=None, engine="openpyxl")
    fsheet = choose_features_sheet(feat_book)
    raw = feat_book[fsheet].copy()

    dcol, tcol = find_day0(raw), find_ticker(raw)
    feat_g, numeric_cols = group_numeric(raw, dcol, tcol)

    for w in WINDOWS:
        es = win_map[w]
        if es is None:
            print(f"[{f}] window {w} sheet not found. Skipping.")
            continue

        ev = evt_book[es].copy()
        ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
        ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
        ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])

        merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner").copy()
        if merged.empty:
            all_rows.append({"model":f, "window":w, "splits":0, "mean_oos_coefficient_of_determination":np.nan,
                             "median_oos_coefficient_of_determination":np.nan, "rows_used":0, "features_used":0})
            continue

        X = build_X(merged, numeric_cols, ycol)
        y = merged[ycol].astype(float)

        merged = merged.assign(__q__ = to_quarter(merged["__day0__"]))
        splits = time_splits(merged, min_train_quarters=MIN_TRAIN_QUARTERS)

        split_scores=[]
        for tr_idx, te_idx, test_q in splits:
            Xtr, ytr = X.iloc[tr_idx], y.iloc[tr_idx]
            Xte, yte = X.iloc[te_idx], y.iloc[te_idx]

            if BLOCK_SAME_TICKERS:
                te_tics = set(merged.iloc[te_idx]["__tic__"])
                keep_tr = ~merged.iloc[tr_idx]["__tic__"].isin(te_tics).values
                Xtr, ytr = Xtr.iloc[keep_tr], ytr.iloc[keep_tr]
                if len(ytr) < X.shape[1] + 2:
                    continue

            if len(ytr)==0 or len(yte)==0:
                continue

            r2 = test_r2_on_split(Xtr, ytr, Xte, yte)
            split_scores.append((str(test_q), float(r2), len(yte)))

            per_split_rows.append({
                "model": f, "window": w, "test_quarter": str(test_q),
                "test_rows": len(yte), "oos_coefficient_of_determination": float(r2)
            })

        if split_scores:
            scores = [s[1] for s in split_scores if np.isfinite(s[1])]
            mean_oos = float(np.nanmean(scores)) if scores else np.nan
            median_oos = float(np.nanmedian(scores)) if scores else np.nan
            all_rows.append({
                "model": f, "window": w, "splits": len(split_scores),
                "mean_oos_coefficient_of_determination": mean_oos,
                "median_oos_coefficient_of_determination": median_oos,
                "rows_used": len(X), "features_used": X.shape[1]
            })
        else:
            all_rows.append({
                "model": f, "window": w, "splits": 0,
                "mean_oos_coefficient_of_determination": np.nan,
                "median_oos_coefficient_of_determination": np.nan,
                "rows_used": len(X), "features_used": X.shape[1]
            })

# ---------------- SAVE RESULTS (robust to empty) ----------------
summary = pd.DataFrame(all_rows).sort_values(
    ["window","mean_oos_coefficient_of_determination"],
    ascending=[True, False]
).reset_index(drop=True)

per_split_cols = ["model","window","test_quarter","test_rows","oos_coefficient_of_determination"]
per_split = pd.DataFrame(per_split_rows, columns=per_split_cols)
if not per_split.empty:
    per_split = per_split.sort_values(["window","test_quarter","model"]).reset_index(drop=True)

sum_path = out_dir / "time_cv_v12_vs_v13_summary.csv"
split_path = out_dir / "time_cv_v12_vs_v13_per_quarter.csv"
summary.to_csv(sum_path, index=False)
per_split.to_csv(split_path, index=False)
print("Saved:", sum_path)
print("Saved:", split_path)

# ---------------- PLOTS (only if we have rows) ----------------
if SAVE_PLOTS and not per_split.empty:
    for w in WINDOWS:
        sub = per_split[per_split.window==w]
        if sub.empty: 
            continue
        fig = plt.figure(figsize=(9,5))
        for mdl, g in sub.groupby("model"):
            qorder = pd.Index(g["test_quarter"].unique()).sort_values()
            order_map = {q:i for i,q in enumerate(qorder)}
            gg = g.copy()
            gg["__ord__"] = gg["test_quarter"].map(order_map)
            gg = gg.sort_values("__ord__")
            plt.plot(gg["__ord__"], gg["oos_coefficient_of_determination"], marker="o", label=mdl)
        plt.title(f"Time-aware out-of-sample coefficient of determination by quarter — window {w}")
        plt.xlabel("Quarter (ordered)"); plt.ylabel("Out-of-sample coefficient of determination")
        plt.legend(); plt.tight_layout()
        png_path = out_dir / f"time_cv_oos_r2_window_{w.replace(',','_')}.png"
        plt.savefig(png_path, dpi=150)
        plt.show()
        print("Saved:", png_path)

# ---------------- PRINT SUMMARY ----------------
print("\nTime-aware cross validation — mean out-of-sample coefficient of determination")
display(summary)
if per_split.empty:
    print("Note: No valid per-quarter splits were created (likely not enough history or all splits were skipped after ticker blocking).")
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\time_cv_v12_vs_v13_summary.csv
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\time_cv_v12_vs_v13_per_quarter.csv

Time-aware cross validation — mean out-of-sample coefficient of determination
model window splits mean_oos_coefficient_of_determination median_oos_coefficient_of_determination rows_used features_used
0 v1.2.xlsx 0,1 0 NaN NaN 129 7
1 v1.3.xlsx 0,1 0 NaN NaN 129 2
2 v1.2.xlsx 0,3 0 NaN NaN 129 7
3 v1.3.xlsx 0,3 0 NaN NaN 129 2
4 v1.2.xlsx 0,5 0 NaN NaN 129 7
5 v1.3.xlsx 0,5 0 NaN NaN 129 2
Note: No valid per-quarter splits were created (likely not enough history or all splits were skipped after ticker blocking).
In [10]:
# === AAPL feature importance for v1.2 (window 0,5) with time-aware 4-quarter test blocks ===
# Fixes the NaN issue by using multi-quarter test blocks and pooled OOS R^2.
# Joins on day0 + ticker, winsorises/standardises on train only.

from pathlib import Path
import re, numpy as np, pandas as pd
from sklearn.linear_model import LinearRegression

# ---------------- CONFIG ----------------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
    Path("/mnt/data"),
    Path(".")
]
EVENT_FILE    = "event_study.xlsx"
FEATURES_FILE = "v1.2.xlsx"
WINDOW        = "0,5"
TICKER        = "AAPL"

MIN_TRAIN_QUARTERS   = 4     # expanding train must include at least this many quarters
TEST_BLOCK_QUARTERS  = 4     # test on N consecutive quarters to get >=2 points per test
STEP_QUARTERS        = 1     # slide test window by this many quarters
MIN_TEST_SIZE        = 2     # require at least this many rows in a test block
WINSOR_PCTS          = (1, 99)
np.random.seed(42)

# ---------------- HELPERS ----------------
def find_file(name):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(name)

def is_readme(name): 
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))

def choose_features_sheet(book):
    cands = [(n, df) for n, df in book.items() if not is_readme(n)]
    if not cands: return next(iter(book))
    def score(item):
        _, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def window_sheets(book):
    out = {"0,1":None,"0,3":None,"0,5":None}
    pats = {"0,1":r"(car.*)?0\D*1(?!\d)", "0,3":r"(car.*)?0\D*3(?!\d)", "0,5":r"(car.*)?0\D*5(?!\d)"}
    for nm in book:
        if is_readme(nm): continue
        for w, pat in pats.items():
            if out[w] is None and re.search(pat, str(nm), re.IGNORECASE):
                out[w] = nm
    return out

def find_day0(df):
    s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
    if s: return s[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","day0date","date0","Date0","DATE0"]:
        if c in df.columns: return c
    best,k=None,-1
    for c in df.columns:
        kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if kk>k: best,k=c,kk
    return best

def find_ticker(df):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best,score=None,-1
    for c in obj:
        s=df[c].astype(str).str.strip()
        sc=s.nunique() - 0.1*s.str.len().mean()
        if sc>score: best,score=c,sc
    return best

def find_target(df):
    c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
    if c: return c[0]
    c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
    return c[0] if c else None

def norm_day0(s):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def norm_tic(s): 
    return s.astype(str).str.strip().str.upper()

def group_numeric(df, dcol, tcol):
    g=df.copy()
    g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
    nums=g.select_dtypes(include=[np.number]).columns.tolist()
    g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
         .dropna(subset=["__day0__","__tic__"]))
    return g, nums

def build_X(merged, numeric_cols, ycol):
    keep=[c for c in numeric_cols if c in merged.columns]
    X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
    nunq=X.nunique(dropna=False)
    return X.loc[:, nunq>1]

def to_quarter(s):
    d = pd.to_datetime(s, errors="coerce")
    return d.dt.to_period("Q").astype(str)

def fit_transformers(Xtr, lo=1, hi=99):
    stats={}
    Xw=Xtr.copy()
    for c in Xw.columns:
        lo_v, hi_v = np.nanpercentile(Xw[c].values, [lo, hi])
        clamped = Xw[c].clip(lo_v, hi_v)
        mu = float(np.nanmean(clamped))
        sd = float(np.nanstd(clamped, ddof=0)) or 1.0
        stats[c] = {"lo": float(lo_v), "hi": float(hi_v), "mu": mu, "sd": sd}
        Xw[c] = (clamped - mu) / sd
    return stats, Xw

def apply_transformers(Xte, stats):
    Xw=Xte.copy()
    for c in Xw.columns:
        if c not in stats: continue
        lo_v, hi_v, mu, sd = stats[c]["lo"], stats[c]["hi"], stats[c]["mu"], stats[c]["sd"]
        Xw[c] = (Xw[c].clip(lo_v, hi_v) - mu) / sd
    return Xw

def predict_fold(Xtr, ytr, Xte, yte):
    stats, Xtr_s = fit_transformers(Xtr, lo=WINSOR_PCTS[0], hi=WINSOR_PCTS[1])
    Xte_s = apply_transformers(Xte, stats)
    m = LinearRegression().fit(Xtr_s.values, ytr.values)
    yh = m.predict(Xte_s.values)
    return yh, m.coef_

def pooled_oos_r2(y_true_all, y_pred_all):
    yt = np.asarray(y_true_all)
    yp = np.asarray(y_pred_all)
    ss_res = np.sum((yt - yp)**2)
    ss_tot = np.sum((yt - yt.mean())**2)
    return float(1 - ss_res/ss_tot) if ss_tot > 0 else np.nan

def build_time_blocks(df, min_train_q=4, test_block_q=4, step_q=1):
    q = to_quarter(df["__day0__"])
    uniq = pd.Index(q.unique()).sort_values()
    blocks=[]
    for start in range(min_train_q, len(uniq)-0):
        # train quarters
        train_q = set(uniq[:start])
        # test block quarters
        if start + test_block_q > len(uniq):
            break
        test_q = set(uniq[start:start+test_block_q])
        tr_idx = q.isin(train_q).values
        te_idx = q.isin(test_q).values
        if tr_idx.sum() >= 1 and te_idx.sum() >= MIN_TEST_SIZE:
            blocks.append((np.where(tr_idx)[0], np.where(te_idx)[0],
                           f"{uniq[start]}..{uniq[start+test_block_q-1]}"))
        # step
        if step_q > 1:
            start += (step_q-1)
    return blocks

# ---------------- LOAD AAPL DATA ----------------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
sheet = win_map.get(WINDOW)
assert sheet is not None, f"Could not find event sheet for window {WINDOW}"

ev = evt_book[sheet].copy()
ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])
ev = ev[ev["__tic__"] == TICKER]

feat_book = pd.read_excel(find_file(FEATURES_FILE), sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
raw = feat_book[fsheet].copy()
dcol, tcol = find_day0(raw), find_ticker(raw)
feat_g, numeric_cols = group_numeric(raw, dcol, tcol)
feat_g = feat_g[feat_g["__tic__"] == TICKER]

merged = feat_g.merge(ev[["__day0__","__tic__", ycol]], on=["__day0__","__tic__"], how="inner").copy()
assert not merged.empty, "No AAPL rows after merge. Check keys/values."

# Build X,y
X_full = build_X(merged, numeric_cols, ycol)
nunq = X_full.nunique(dropna=False)
X_full = X_full.loc[:, nunq > 1]
y_full = merged[ycol].astype(float)

# Diagnostics
q = to_quarter(merged["__day0__"])
print(f"AAPL rows: {len(X_full)} | features: {X_full.shape[1]} | unique quarters: {q.nunique()}")

# Build time-aware blocks
merged = merged.assign(__q__ = q)
blocks = build_time_blocks(merged, min_train_q=MIN_TRAIN_QUARTERS,
                           test_block_q=TEST_BLOCK_QUARTERS, step_q=STEP_QUARTERS)
assert len(blocks) > 0, "No valid time blocks. Reduce TEST_BLOCK_QUARTERS or MIN_TRAIN_QUARTERS."

# -------- Baseline pooled OOS predictions (all features) --------
y_true_all, y_pred_all = [], []
coef_abs = {f: [] for f in X_full.columns}
for tr, te, label in blocks:
    Xtr, ytr = X_full.iloc[tr], y_full.iloc[tr]
    Xte, yte = X_full.iloc[te], y_full.iloc[te]
    yh, coef = predict_fold(Xtr, ytr, Xte, yte)
    y_true_all.extend(yte.tolist())
    y_pred_all.extend(yh.tolist())
    for f,c in zip(X_full.columns, coef):
        coef_abs[f].append(abs(float(c)))

base_oos_r2 = pooled_oos_r2(y_true_all, y_pred_all)
print(f"Baseline pooled OOS R^2 (all features): {base_oos_r2:.4f}")

# -------- Leave-one-feature-out (LOFO) pooled OOS R^2 deltas --------
lofo_delta = {}
for fdrop in X_full.columns:
    y_true_all, y_pred_all = [], []
    Xm = X_full.drop(columns=[fdrop])
    for tr, te, label in blocks:
        Xtr, ytr = Xm.iloc[tr], y_full.iloc[tr]
        Xte, yte = Xm.iloc[te], y_full.iloc[te]
        yh, _ = predict_fold(Xtr, ytr, Xte, yte)
        y_true_all.extend(yte.tolist())
        y_pred_all.extend(yh.tolist())
    oos = pooled_oos_r2(y_true_all, y_pred_all)
    lofo_delta[fdrop] = base_oos_r2 - oos  # positive = helpful; negative = harmful

# -------- Permutation importance over test blocks --------
perm_drop = {f: [] for f in X_full.columns}
for tr, te, label in blocks:
    # train baseline on this block
    Xtr, ytr = X_full.iloc[tr], y_full.iloc[tr]
    Xte, yte = X_full.iloc[te], y_full.iloc[te]
    yh_base, _ = predict_fold(Xtr, ytr, Xte, yte)
    r2_base = pooled_oos_r2(yte.values, np.asarray(yh_base))
    for f in X_full.columns:
        Xperm = Xte.copy()
        Xperm[f] = np.random.permutation(Xperm[f].values)  # permute within block
        yh_perm, _ = predict_fold(Xtr, ytr, Xperm, yte)
        r2_perm = pooled_oos_r2(yte.values, np.asarray(yh_perm))
        drop = (r2_base - r2_perm) if (np.isfinite(r2_base) and np.isfinite(r2_perm)) else np.nan
        perm_drop[f].append(drop)

# -------- Assemble importance table --------
imp = pd.DataFrame({
    "feature": list(X_full.columns),
    "lofo_delta_oos_r2": [lofo_delta[f] for f in X_full.columns],
    "perm_drop_in_test_r2": [float(np.nanmean(perm_drop[f])) for f in X_full.columns],
    "mean_abs_std_coef": [float(np.nanmean(coef_abs[f])) for f in X_full.columns],
})

# Ranks (1 = most important)
imp["rank_lofo"] = imp["lofo_delta_oos_r2"].rank(ascending=False, method="min")
imp["rank_perm"] = imp["perm_drop_in_test_r2"].rank(ascending=False, method="min")
imp["rank_coef"] = imp["mean_abs_std_coef"].rank(ascending=False, method="min")
imp["aggregate_rank"] = imp[["rank_lofo","rank_perm","rank_coef"]].mean(axis=1)

imp = imp.sort_values("aggregate_rank").reset_index(drop=True)

# Save
out_dir = find_file(EVENT_FILE).parent
out_csv = out_dir / f"aapl_feature_importance_v12_window_{WINDOW.replace(',','_')}_testblock{TEST_BLOCK_QUARTERS}.csv"
imp.to_csv(out_csv, index=False)
display(imp)
print("Saved:", out_csv)
AAPL rows: 43 | features: 7 | unique quarters: 43
Baseline pooled OOS R^2 (all features): -1.6197
feature lofo_delta_oos_r2 perm_drop_in_test_r2 mean_abs_std_coef rank_lofo rank_perm rank_coef aggregate_rank
0 pre_vol_5d -0.360582 6.514350 0.023092 4.0 1.0 2.0 2.333333
1 eps_surprise_pct -0.155074 1.411248 0.021375 3.0 2.0 3.0 2.666667
2 pre_ret_3d 0.460483 -0.172005 0.025275 1.0 7.0 1.0 3.000000
3 mkt_ret_5d_lag1 0.238020 0.922156 0.020129 2.0 3.0 4.0 3.000000
4 vix_level_lag1 -0.430675 0.627245 0.017450 6.0 4.0 5.0 5.000000
5 vix_chg_5d_lag1 -0.385391 0.200592 0.014840 5.0 6.0 6.0 5.666667
6 macro_us10y -0.629730 0.219502 0.014301 7.0 5.0 7.0 6.333333
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\aapl_feature_importance_v12_window_0_5_testblock4.csv
In [3]:
# === Baseline v1 feature importance on CAR(0,5) ===
# Needs: pandas, numpy, scikit-learn, openpyxl, matplotlib (optional)

import re, numpy as np, pandas as pd
from pathlib import Path
from sklearn.model_selection import GroupKFold, KFold
from sklearn.linear_model import LinearRegression

# ---------- Paths (tries your Windows folder first, then /mnt/data) ----------
BASE_DIRS = [
    Path(r"C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model"),
    Path("/mnt/data"),
    Path(".")
]
EVENT_FILE = "event_study.xlsx"       # targets live here (CAR sheets)
FEATURES_FILE = "Baseline v1.xlsx"    # your baseline features
WINDOW = "0,5"                        # focus window

MAX_GROUP_FOLDS = 5
WINSOR_PCTS = (1, 99)
np.random.seed(42)

def find_file(name):
    for b in BASE_DIRS:
        p = b / name
        if p.exists(): return p
    raise FileNotFoundError(name)

def is_readme(name): 
    return bool(re.search(r"read\s*me|readme|notes?|about|info", str(name), re.I))

def window_sheets(book):
    out = {"0,1":None,"0,3":None,"0,5":None}
    pats = {"0,1":r"(car.*)?0\D*1(?!\d)","0,3":r"(car.*)?0\D*3(?!\d)","0,5":r"(car.*)?0\D*5(?!\d)"}
    for nm in book:
        if is_readme(nm): continue
        for w,pat in pats.items():
            if out[w] is None and re.search(pat, str(nm), re.I):
                out[w]=nm
    return out

def choose_features_sheet(book):
    cands = [(n, df) for n, df in book.items() if not is_readme(n)]
    if not cands: return next(iter(book))
    def score(item):
        _, df = item
        return (df.select_dtypes(include=[np.number]).shape[1], len(df))
    return max(cands, key=score)[0]

def find_day0(df):
    s=[c for c in df.columns if re.search(r"\bday[\s_]*0\b", str(c), re.I)]
    if s: return s[0]
    for c in ["event_date","EventDate","announcement_date","ANNOUNCEMENT_DATE",
              "date","Date","trading_date","TradingDate","date0","Date0","DATE0"]:
        if c in df.columns: return c
    # fallback: most date-like
    best,k=None,-1
    for c in df.columns:
        kk = pd.to_datetime(df[c], errors="coerce").notna().sum()
        if kk>k: best,k=c,kk
    return best

def find_ticker(df):
    for c in ["ticker","Ticker","symbol","Symbol","RIC","ric","ISIN","isin","CUSIP","cusip","SEDOL","sedol"]:
        if c in df.columns: return c
    obj = df.select_dtypes(include=["object"]).columns
    best,score=None,-1
    for c in obj:
        s=df[c].astype(str).str.strip()
        sc=s.nunique() - 0.1*s.str.len().mean()
        if sc>score: best,score=c,sc
    return best

def find_target(df):
    c=[c for c in df.columns if re.search(r"\bcar\b", str(c), re.I)]
    if c: return c[0]
    c=[c for c in df.columns if re.search(r"cumul.*abnorm.*return", str(c), re.I)]
    return c[0] if c else None

def norm_day0(s):
    a = pd.to_datetime(s, errors="coerce").dt.normalize()
    b = pd.to_datetime(s, errors="coerce", dayfirst=True).dt.normalize()
    return b.where(b.notna(), a)

def norm_tic(s):
    return s.astype(str).str.strip().str.upper()

def group_numeric(df, dcol, tcol):
    g=df.copy()
    g["__day0__"]=norm_day0(g[dcol]); g["__tic__"]=norm_tic(g[tcol])
    nums=g.select_dtypes(include=[np.number]).columns.tolist()
    g=(g.groupby(["__day0__","__tic__"], as_index=False)[nums].mean()
         .dropna(subset=["__day0__","__tic__"]))
    return g, nums

def build_X(merged, numeric_cols, ycol):
    keep=[c for c in numeric_cols if c in merged.columns]
    X=merged.loc[:, keep].drop(columns=[ycol], errors="ignore")
    nunq=X.nunique(dropna=False)
    return X.loc[:, nunq>1]

def adjusted_r2(n, p, r2):
    return np.nan if n-p-1<=0 else 1 - (1-r2)*(n-1)/(n-p-1)

# train-only winsor + standardise
def fit_transformers(Xtr, lo=1, hi=99):
    stats={}
    Xw=Xtr.copy()
    for c in Xw.columns:
        lo_v, hi_v = np.nanpercentile(Xw[c].values, [lo, hi])
        clamped = Xw[c].clip(lo_v, hi_v)
        mu = float(np.nanmean(clamped))
        sd = float(np.nanstd(clamped, ddof=0)) or 1.0
        stats[c] = {"lo": float(lo_v), "hi": float(hi_v), "mu": mu, "sd": sd}
        Xw[c] = (clamped - mu) / sd
    return stats, Xw

def apply_transformers(Xte, stats):
    Xw=Xte.copy()
    for c in Xw.columns:
        if c not in stats: continue
        lo_v, hi_v, mu, sd = stats[c]["lo"], stats[c]["hi"], stats[c]["mu"], stats[c]["sd"]
        Xw[c] = (Xw[c].clip(lo_v, hi_v) - mu) / sd
    return Xw

def fold_score_and_coefs(Xtr, ytr, Xte, yte):
    stats, Xtr_s = fit_transformers(Xtr, lo=WINSOR_PCTS[0], hi=WINSOR_PCTS[1])
    Xte_s = apply_transformers(Xte, stats)
    m = LinearRegression().fit(Xtr_s.values, ytr.values)
    yh = m.predict(Xte_s.values)
    ss_res = np.sum((yte.values - yh)**2)
    ss_tot = np.sum((yte.values - yte.values.mean())**2)
    r2 = (1 - ss_res/ss_tot) if ss_tot>0 else np.nan
    return r2, m.coef_

def grouped_splits(X, y, groups, max_folds=5):
    ng=int(pd.Series(groups).nunique())
    if ng>=2:
        return list(GroupKFold(n_splits=min(max_folds, ng)).split(X, y, groups))
    return list(KFold(n_splits=min(3,len(X)), shuffle=True, random_state=42).split(X, y))

# ---------- Load data ----------
evt_book = pd.read_excel(find_file(EVENT_FILE), sheet_name=None, engine="openpyxl")
win_map = window_sheets(evt_book)
sheet = win_map.get(WINDOW)
assert sheet is not None, f"Could not find CAR sheet for window {WINDOW}"

ev = evt_book[sheet].copy()
ed, et, ycol = find_day0(ev), find_ticker(ev), find_target(ev)
ev["__day0__"]=norm_day0(ev[ed]); ev["__tic__"]=norm_tic(ev[et])
ev = ev.dropna(subset=["__day0__","__tic__", ycol]).drop_duplicates(subset=["__day0__","__tic__"])

feat_book = pd.read_excel(find_file(FEATURES_FILE), sheet_name=None, engine="openpyxl")
fsheet = choose_features_sheet(feat_book)
raw = feat_book[fsheet].copy()
dcol, tcol = find_day0(raw), find_ticker(raw)
feat_g, numeric_cols = group_numeric(raw, dcol, tcol)

merged = feat_g.merge(ev[["__day0__","__tic__", ycol]],
                      on=["__day0__","__tic__"], how="inner").copy()

# Build design
X = build_X(merged, numeric_cols, ycol)
y = merged[ycol].astype(float)
groups = merged["__tic__"]

# Baseline CV R^2 with all features
splits = grouped_splits(X, y, groups, MAX_GROUP_FOLDS)
base_scores, coef_abs = [], {f: [] for f in X.columns}
for tr, te in splits:
    r2, coef = fold_score_and_coefs(X.iloc[tr], y.iloc[tr], X.iloc[te], y.iloc[te])
    base_scores.append(r2)
    for f,c in zip(X.columns, coef):
        coef_abs[f].append(abs(float(c)))
base_cv_r2 = float(np.nanmean(base_scores)) if base_scores else np.nan
coef_mean = {f: float(np.nanmean(v)) for f,v in coef_abs.items()}

# LOFO Δ CV-R^2
lofo = {}
for f in X.columns:
    scores=[]
    Xm = X.drop(columns=[f])
    for tr, te in splits:
        r2, _ = fold_score_and_coefs(Xm.iloc[tr], y.iloc[tr], Xm.iloc[te], y.iloc[te])
        scores.append(r2)
    cv_without = float(np.nanmean(scores)) if scores else np.nan
    lofo[f] = base_cv_r2 - cv_without   # + = helpful; - = harmful

# Permutation drop (average over folds)
perm = {f: [] for f in X.columns}
for tr, te in splits:
    r2_base, _ = fold_score_and_coefs(X.iloc[tr], y.iloc[tr], X.iloc[te], y.iloc[te])
    if not np.isfinite(r2_base): 
        for f in X.columns: perm[f].append(np.nan)
        continue
    Xte = X.iloc[te].copy()
    for f in X.columns:
        Xp = Xte.copy()
        Xp[f] = np.random.permutation(Xp[f].values)
        r2_perm, _ = fold_score_and_coefs(X.iloc[tr], y.iloc[tr], Xp, y.iloc[te])
        perm[f].append(r2_base - r2_perm if np.isfinite(r2_perm) else np.nan)
perm_mean = {f: float(np.nanmean(v)) for f,v in perm.items()}

# Importance table
imp = pd.DataFrame({
    "feature": list(X.columns),
    "lofo_delta_cv_r2": [lofo[f] for f in X.columns],
    "perm_drop_in_test_r2": [perm_mean[f] for f in X.columns],
    "mean_abs_std_coef": [coef_mean[f] for f in X.columns],
})
imp["rank_lofo"] = imp["lofo_delta_cv_r2"].rank(ascending=False, method="min")
imp["rank_perm"] = imp["perm_drop_in_test_r2"].rank(ascending=False, method="min")
imp["rank_coef"] = imp["mean_abs_std_coef"].rank(ascending=False, method="min")
imp["aggregate_rank"] = imp[["rank_lofo","rank_perm","rank_coef"]].mean(axis=1)
imp = imp.sort_values("aggregate_rank").reset_index(drop=True)

# Label for action
imp["action"] = np.where(imp["lofo_delta_cv_r2"] < 0,
                         "candidate_to_drop",
                         "keep_or_review")

print(f"Rows used: {len(X)} | Features used: {X.shape[1]}")
print(f"Baseline CV R^2 (all features): {base_cv_r2:.4f}")
display(imp.head(12))

# Save
out_path = find_file(EVENT_FILE).parent / "baseline_v1_feature_importance_window_0_5.csv"
imp.to_csv(out_path, index=False)
print("Saved:", out_path)

# Quick keep/drop shortlists
keep = imp.sort_values(["lofo_delta_cv_r2","perm_drop_in_test_r2"], ascending=False).head(10)[["feature","lofo_delta_cv_r2","perm_drop_in_test_r2"]]
drop = imp.sort_values(["lofo_delta_cv_r2","perm_drop_in_test_r2"], ascending=[True, True]).head(10)[["feature","lofo_delta_cv_r2","perm_drop_in_test_r2"]]
print("\nTop KEEP candidates:")
display(keep)
print("\nTop DROP candidates:")
display(drop)
Rows used: 129 | Features used: 16
Baseline CV R^2 (all features): -0.0331
feature lofo_delta_cv_r2 perm_drop_in_test_r2 mean_abs_std_coef rank_lofo rank_perm rank_coef aggregate_rank action
0 pre_ret_3d 0.108232 0.173534 0.019858 1.0 1.0 2.0 1.333333 keep_or_review
1 eps_surprise_pct 0.107472 0.143380 0.018775 2.0 4.0 3.0 3.000000 keep_or_review
2 vix_level_lag1 0.035916 0.150289 0.018590 3.0 3.0 4.0 3.333333 keep_or_review
3 macro_us10y 0.019432 0.136189 0.022850 4.0 5.0 1.0 3.333333 keep_or_review
4 mkt_ret_5d_lag1 0.002695 0.129564 0.013140 6.0 6.0 8.0 6.666667 keep_or_review
5 mkt_ret_10d_lag1 -0.033044 0.153363 0.018457 13.0 2.0 5.0 6.666667 candidate_to_drop
6 pre_ret_5d -0.032872 0.101560 0.015816 12.0 7.0 6.0 8.333333 candidate_to_drop
7 vix_chg_10d_lag1 0.006345 0.082645 0.010040 5.0 8.0 12.0 8.333333 keep_or_review
8 vix_chg_5d_lag1 -0.005810 0.058900 0.010042 9.0 9.0 11.0 9.666667 candidate_to_drop
9 macro_cpi_yoy -0.023060 0.046462 0.011607 10.0 11.0 9.0 10.000000 candidate_to_drop
10 pre_vol_10d -0.034472 0.058481 0.010715 14.0 10.0 10.0 11.333333 candidate_to_drop
11 macro_fedfunds -0.000884 0.018450 0.009147 8.0 13.0 13.0 11.333333 candidate_to_drop
Saved: C:\Users\dcazo\OneDrive\Documents\Data Analysis\LSE\4. Course 4\3. Model\baseline_v1_feature_importance_window_0_5.csv

Top KEEP candidates:
feature lofo_delta_cv_r2 perm_drop_in_test_r2
0 pre_ret_3d 0.108232 0.173534
1 eps_surprise_pct 0.107472 0.143380
2 vix_level_lag1 0.035916 0.150289
3 macro_us10y 0.019432 0.136189
7 vix_chg_10d_lag1 0.006345 0.082645
4 mkt_ret_5d_lag1 0.002695 0.129564
14 pre_vol_5d -0.000318 -0.012311
11 macro_fedfunds -0.000884 0.018450
8 vix_chg_5d_lag1 -0.005810 0.058900
9 macro_cpi_yoy -0.023060 0.046462
Top DROP candidates:
feature lofo_delta_cv_r2 perm_drop_in_test_r2
12 pre_ret_10d -0.045927 0.036502
15 mkt_ret_1d_lag1 -0.040140 0.009476
10 pre_vol_10d -0.034472 0.058481
5 mkt_ret_10d_lag1 -0.033044 0.153363
6 pre_ret_5d -0.032872 0.101560
13 pre_vol_3d -0.030640 0.015875
9 macro_cpi_yoy -0.023060 0.046462
8 vix_chg_5d_lag1 -0.005810 0.058900
11 macro_fedfunds -0.000884 0.018450
14 pre_vol_5d -0.000318 -0.012311
In [ ]:
 
In [8]:
import os
from typing import Dict, List, Optional, Tuple

import numpy as np
import pandas as pd
import requests
from dotenv import load_dotenv
from pandas_datareader import data as web
import yfinance as yf

# ================== CONFIG & ENV ================== #

load_dotenv()

FRED_API_KEY = os.getenv("FRED_API_KEY")
ALPHAVANTAGE_API_KEY = os.getenv("ALPHAVANTAGE_API_KEY")

# Tickers for the strategy
TICKERS: List[str] = ["AAPL", "NVDA", "GOOGL"]

# Backtest date range
BACKTEST_START = "2000-01-01"
BACKTEST_END: Optional[str] = None  # None = today

# Event study settings
# IMPORTANT: 0..5 means 6 daily returns: r0..r5, i.e. from C[-1] -> C[5]
EVENT_WINDOW = (0, 5)              # returns at day0..day5
ESTIMATION_LOOKBACK = 120          # -120..-20 trading days
ESTIMATION_GAP = 20                # gap from day0 back to end of estimation
WINSOR_P = 0.01                    # 1% tails for returns


# ================== GENERIC HELPERS ================== #

def get_date_range() -> Tuple[str, str]:
    start_dt = pd.to_datetime(BACKTEST_START)
    if BACKTEST_END is None:
        end_dt = pd.Timestamp.today().normalize()
    else:
        end_dt = pd.to_datetime(BACKTEST_END).normalize()
    return start_dt.strftime("%Y-%m-%d"), end_dt.strftime("%Y-%m-%d")


def winsorize_series(s: pd.Series, p: float) -> pd.Series:
    if s.empty:
        return s
    lower = s.quantile(p)
    upper = s.quantile(1.0 - p)
    return s.clip(lower, upper)


# ================== PRICES VIA YFINANCE + LOCAL CSV ================== #

def fetch_prices_yf(symbol: str, start: str, end: str) -> Optional[pd.DataFrame]:
    """
    Pull daily OHLCV from Yahoo via yfinance.
    """
    try:
        df = yf.download(symbol, start=start, end=end, auto_adjust=False, progress=False)
    except Exception as e:
        print(f"{symbol}: yfinance price download failed: {e}")
        return None

    if df is None or df.empty:
        print(f"{symbol}: yfinance returned no price data.")
        return None

    df = df.copy()
    df.index = pd.to_datetime(df.index)
    df.index.name = "date"
    df = df.sort_index()

    df = df.rename(
        columns={
            "Open": "open",
            "High": "high",
            "Low": "low",
            "Close": "close",
            "Adj Close": "adj_close",
            "Volume": "volume",
        }
    )

    for col in ["open", "high", "low", "close", "adj_close", "volume"]:
        if col not in df.columns:
            df[col] = np.nan

    return df[["open", "high", "low", "close", "adj_close", "volume"]]


def get_prices_with_fallback(symbol: str, start: str, end: str) -> Optional[pd.DataFrame]:
    """
    Try prices in this order:
      1) yfinance (Yahoo)
      2) Local CSV fallback: {SYMBOL}.csv
      3) Legacy fallback: {SYMBOL}_from_prices_clean.csv
    """
    # 1) Try online from yfinance
    df = fetch_prices_yf(symbol, start, end)
    if df is not None and not df.empty:
        print(f"{symbol}: got {len(df)} daily rows from yfinance")
        return df

    # 2–3) Try local files
    candidate_files = [
        f"{symbol}.csv",
        f"{symbol}_from_prices_clean.csv",
    ]

    for csv_path in candidate_files:
        if csv_path and os.path.exists(csv_path):
            print(f"{symbol}: using local price file {csv_path} as fallback.")
            df_local = pd.read_csv(csv_path)

            if "date" in df_local.columns:
                df_local["date"] = pd.to_datetime(df_local["date"])
                df_local = df_local.sort_values("date").set_index("date")
            else:
                df_local.index = pd.to_datetime(df_local.index)
                df_local = df_local.sort_index()

            # ensure adj_close exists
            if "adj_close" not in df_local.columns:
                if "close" in df_local.columns:
                    df_local["adj_close"] = df_local["close"]
                else:
                    df_local["adj_close"] = np.nan

            # filter to requested date range
            mask = (df_local.index >= pd.to_datetime(start)) & (df_local.index <= pd.to_datetime(end))
            df_local = df_local.loc[mask]

            for col in ["open", "high", "low", "close", "adj_close", "volume"]:
                if col not in df_local.columns:
                    df_local[col] = np.nan

            df_local.index.name = "date"
            return df_local[["open", "high", "low", "close", "adj_close", "volume"]]

    print(f"{symbol}: FAILED to get prices from yfinance and no local CSV found.")
    return None

def download_all_prices(start: str, end: str) -> Dict[str, pd.DataFrame]:
    """
    Download prices for all tickers and save per-ticker CSVs:

        AAPL.csv, NVDA.csv, GOOGL.csv

    Each file has columns:
        date, open, high, low, close, adj_close, volume

    Also returns a dict of per-ticker DataFrames indexed by date.
    """
    px_raw: Dict[str, pd.DataFrame] = {}

    for sym in TICKERS:
        df = get_prices_with_fallback(sym, start, end)
        if df is None or df.empty:
            print(f"{sym}: no price data available at all.")
            continue

        df = df.copy()
        # make sure we have a DatetimeIndex named 'date'
        if not isinstance(df.index, pd.DatetimeIndex):
            if "date" in df.columns:
                df["date"] = pd.to_datetime(df["date"])
                df = df.set_index("date")
            else:
                df.index = pd.to_datetime(df.index)

        df.index.name = "date"
        df = df.sort_index()

        # store in memory for the rest of the pipeline
        px_raw[sym] = df[["open", "high", "low", "close", "adj_close", "volume"]]

        # write per-ticker CSV with a date column, not index
        out = df.reset_index()
        out_path = f"{sym}.csv"
        try:
            out.to_csv(out_path, index=False)
            print(f"Saved {out_path} with {len(out)} rows.")
        except PermissionError:
            alt = f"{sym}_new.csv"
            out.to_csv(alt, index=False)
            print(
                f"Could not overwrite {out_path} (maybe open in Excel). "
                f"Saved prices to {alt} instead."
            )

    if not px_raw:
        print("No prices downloaded – no per-ticker CSVs written.")

    return px_raw


# ================== FAMA–FRENCH 3 FACTORS (DAILY, FLAT) ================== #

def fetch_ff_factors(start: str, end: str) -> pd.DataFrame:
    """
    Fetch daily Fama–French 3 factors, flatten to a DataFrame with a 'date' column,
    and write ff_factors_daily.csv.
    """
    print("Fetching Fama–French 3 factors (daily)...")
    start_dt = pd.to_datetime(start)
    end_dt = pd.to_datetime(end)

    ff3 = web.DataReader("F-F_Research_Data_Factors_Daily", "famafrench", start_dt)[0]

    ff3 = ff3.copy()
    ff3.index = pd.to_datetime(ff3.index)

    ff3 = ff3[(ff3.index >= start_dt) & (ff3.index <= end_dt)]

    df = ff3.rename(
        columns={
            "Mkt-RF": "Mkt_RF",
            "SMB": "SMB",
            "HML": "HML",
            "RF": "RF",
        }
    )

    df = df.reset_index()
    date_col = df.columns[0]
    df = df.rename(columns={date_col: "date"})
    df["date"] = pd.to_datetime(df["date"])

    for col in ["Mkt_RF", "SMB", "HML", "RF"]:
        df[col] = df[col] / 100.0

    df_out = df[["date", "Mkt_RF", "SMB", "HML", "RF"]].copy()
    df_out.to_csv("ff_factors_daily.csv", index=False)
    print("Saved ff_factors_daily.csv")

    return df_out


# ================== EARNINGS: ALPHA VANTAGE ONLY ================== #

def fetch_earnings_alpha_vantage(symbol: str, start_dt: pd.Timestamp, end_dt: pd.Timestamp) -> pd.DataFrame:
    """
    Reported EPS and estimate from Alpha Vantage EARNINGS endpoint.
    """
    if not ALPHAVANTAGE_API_KEY:
        print("ALPHAVANTAGE_API_KEY not set – no EPS.")
        return pd.DataFrame()

    url = "https://www.alphavantage.co/query"
    params = {
        "function": "EARNINGS",
        "symbol": symbol,
        "apikey": ALPHAVANTAGE_API_KEY,
    }
    try:
        r = requests.get(url, params=params, timeout=20)
        r.raise_for_status()
        data = r.json()
    except Exception as e:
        print(f"{symbol}: Alpha Vantage earnings failed: {e}")
        return pd.DataFrame()

    q = data.get("quarterlyEarnings", [])
    if not q:
        print(f"{symbol}: Alpha Vantage returned no quarterlyEarnings.")
        return pd.DataFrame()

    rows = []
    for item in q:
        d_str = item.get("reportedDate") or item.get("fiscalDateEnding")
        if not d_str:
            continue
        ad = pd.to_datetime(d_str).normalize()

        if ad < start_dt or ad > end_dt:
            continue

        rep = item.get("reportedEPS")
        est = item.get("estimatedEPS")
        if rep is None or est is None:
            continue

        try:
            eps_actual = float(rep)
            eps_est_val = float(est)
        except Exception:
            continue

        rows.append(
            {
                "ticker": symbol,
                "announce_date": ad,
                "eps_actual": eps_actual,
                "eps_est": eps_est_val,
            }
        )

    if not rows:
        print(f"{symbol}: Alpha Vantage had no usable EPS rows.")
        return pd.DataFrame()

    df = pd.DataFrame(rows)
    df["ticker"] = symbol
    return df


def combine_all_eps_sources(start_dt: pd.Timestamp, end_dt: pd.Timestamp) -> pd.DataFrame:
    """
    Try to get EPS online first.
    If no online EPS is found for ANY ticker, fall back to local eps_master.csv.

    ALWAYS:
      - return a DataFrame with columns:
          ticker, announce_date, eps_actual, eps_est, n_sources
      - write a working copy to eventearnings.csv

    eps_master.csv is treated as your "master" backup:
      - If it already exists, we NEVER overwrite it.
      - If it does NOT exist and we DO have online data, we create it once.
    """
    all_rows: List[pd.DataFrame] = []
    backup_path = "eps_master.csv"

    # ---------- 1) TRY ONLINE EPS (Alpha Vantage) ----------
    for sym in TICKERS:
        print(f"\nFetching EPS for {sym} from Alpha Vantage...")
        av_df = fetch_earnings_alpha_vantage(sym, start_dt, end_dt)
        if av_df is not None and not av_df.empty:
            all_rows.append(av_df)

    # ---------- 2) NO ONLINE DATA → FALL BACK TO eps_master.csv ----------
    if not all_rows:
        if os.path.exists(backup_path):
            print("No EPS from Alpha Vantage – using local eps_master.csv backup.")
            backup = pd.read_csv(backup_path)

            if backup.empty:
                print("Backup eps_master.csv is empty.")
                cols = ["ticker", "announce_date", "eps_actual", "eps_est", "n_sources"]
                empty = pd.DataFrame(columns=cols)
                empty.to_csv("eventearnings.csv", index=False)
                print("Saved empty eventearnings.csv.")
                return empty

            needed_cols = {"ticker", "announce_date", "eps_actual", "eps_est"}
            missing = needed_cols.difference(backup.columns)
            if missing:
                print(f"Backup eps_master.csv is missing columns {missing}.")
                cols = ["ticker", "announce_date", "eps_actual", "eps_est", "n_sources"]
                empty = pd.DataFrame(columns=cols)
                empty.to_csv("eventearnings.csv", index=False)
                print("Saved empty eventearnings.csv.")
                return empty

            # Clean and filter to backtest range
            backup = backup.copy()
            backup["ticker"] = backup["ticker"].astype(str).str.upper()
            backup["announce_date"] = pd.to_datetime(backup["announce_date"]).dt.normalize()

            mask = (backup["announce_date"] >= start_dt) & (backup["announce_date"] <= end_dt)
            master = backup.loc[mask].copy()

            if "n_sources" not in master.columns:
                master["n_sources"] = 1

            master = master.sort_values(["ticker", "announce_date"]).reset_index(drop=True)

            # IMPORTANT: only write WORKING COPY
            master.to_csv("eventearnings.csv", index=False)
            print(f"Using {len(master)} EPS rows from local backup. Saved eventearnings.csv.")
            return master

        # No online data and no backup file
        print("No EPS from Alpha Vantage and no eps_master.csv backup – creating empty tables.")
        cols = ["ticker", "announce_date", "eps_actual", "eps_est", "n_sources"]
        empty = pd.DataFrame(columns=cols)
        empty.to_csv("eventearnings.csv", index=False)
        print("Saved empty eventearnings.csv.")
        return empty

    # ---------- 3) WE HAVE ONLINE DATA → BUILD MASTER FROM IT ----------
    eps_all = pd.concat(all_rows, ignore_index=True)
    eps_all["ticker"] = eps_all["ticker"].astype(str).str.upper()
    eps_all["announce_date"] = pd.to_datetime(eps_all["announce_date"]).dt.normalize()

    eps_all = eps_all.sort_values(["ticker", "announce_date"])
    eps_all = eps_all.drop_duplicates(subset=["ticker", "announce_date"], keep="last")

    if "n_sources" not in eps_all.columns:
        eps_all["n_sources"] = 1

    master = eps_all[["ticker", "announce_date", "eps_actual", "eps_est", "n_sources"]].copy()
    master = master.sort_values(["ticker", "announce_date"]).reset_index(drop=True)

    # ALWAYS write the events file as a COPY of eps_master
    master.to_csv("eventearnings.csv", index=False)
    print(f"Saved eventearnings.csv with {len(master)} rows from Alpha Vantage.")

    # Only create eps_master.csv automatically if it does NOT exist yet
    if not os.path.exists(backup_path):
        master.to_csv(backup_path, index=False)
        print("No existing eps_master.csv found, so saved a new master from online data.")
    else:
        print("Existing eps_master.csv detected – leaving it untouched.")

    return master

    # ---------- 3) WE HAVE ONLINE DATA → CLEAN AND SAVE ----------
    eps_all = pd.concat(all_rows, ignore_index=True)
    eps_all["ticker"] = eps_all["ticker"].astype(str).str.upper()
    eps_all["announce_date"] = pd.to_datetime(eps_all["announce_date"]).dt.normalize()

    eps_all = eps_all.sort_values(["ticker", "announce_date"])
    eps_all = eps_all.drop_duplicates(subset=["ticker", "announce_date"], keep="last")

    eps_all["n_sources"] = 1

    master = eps_all[["ticker", "announce_date", "eps_actual", "eps_est", "n_sources"]].copy()
    master = master.sort_values(["ticker", "announce_date"]).reset_index(drop=True)

    master.to_csv("eps_master.csv", index=False)
    print(f"Saved eps_master.csv with {len(master)} rows (Alpha Vantage).")
    return master

# ================== FEATURES TABLE (eps_surprise_pct, pre_ret_3d) ================== #

def build_features_table(
    eps_master: pd.DataFrame, px_raw: Dict[str, pd.DataFrame]
) -> pd.DataFrame:
    """
    Build features_model.csv with:
      - eps_surprise_pct = (eps_actual - eps_est) / |eps_est|
      - pre_ret_3d = Price(D-1) / Price(D-4) - 1
    Day0 = first trading day AFTER announce_date (AMC).
    """

    records: List[dict] = []

    if eps_master.empty:
        features_df = pd.DataFrame(
            columns=[
                "ticker",
                "announce_date",
                "day0",
                "eps_actual",
                "eps_est",
                "eps_surprise_pct",
                "pre_ret_3d",
                "n_sources",
            ]
        )
        try:
            features_df.to_csv("features_model.csv", index=False)
            print("Saved empty features_model.csv (no EPS events).")
        except PermissionError:
            alt = "features_model_new.csv"
            features_df.to_csv(alt, index=False)
            print(
                f"Could not overwrite features_model.csv (maybe open in Excel). "
                f"Saved empty features to {alt} instead."
            )
        return features_df

    eps_df = eps_master.copy()
    eps_df["ticker"] = eps_df["ticker"].astype(str).str.upper()
    eps_df["announce_date"] = pd.to_datetime(eps_df["announce_date"]).dt.normalize()

    for _, row in eps_df.iterrows():
        sym = row["ticker"]
        px = px_raw.get(sym)
        if px is None or px.empty:
            continue

        idx = px.index
        announce_date = pd.to_datetime(row["announce_date"]).normalize()

        future_dates = idx[idx > announce_date]
        if len(future_dates) == 0:
            continue
        day0 = future_dates[0]
        loc0 = idx.get_loc(day0)

        if loc0 < 4:
            continue
        loc_minus1 = loc0 - 1
        loc_minus4 = loc0 - 4

        price_minus1 = float(px["adj_close"].iloc[loc_minus1])
        price_minus4 = float(px["adj_close"].iloc[loc_minus4])
        if price_minus4 == 0.0:
            continue

        pre_ret_3d = price_minus1 / price_minus4 - 1.0

        eps_actual = float(row["eps_actual"])
        eps_est = float(row["eps_est"])
        if eps_est == 0:
            continue

        eps_surprise_pct = (eps_actual - eps_est) / abs(eps_est)

        records.append(
            {
                "ticker": sym,
                "announce_date": announce_date,
                "day0": day0,
                "eps_actual": eps_actual,
                "eps_est": eps_est,
                "eps_surprise_pct": eps_surprise_pct,
                "pre_ret_3d": pre_ret_3d,
                "n_sources": int(row.get("n_sources", 1)),
            }
        )

    if not records:
        features_df = pd.DataFrame(
            columns=[
                "ticker",
                "announce_date",
                "day0",
                "eps_actual",
                "eps_est",
                "eps_surprise_pct",
                "pre_ret_3d",
                "n_sources",
            ]
        )
    else:
        features_df = pd.DataFrame(records)
        features_df = features_df.sort_values(["ticker", "day0"]).reset_index(drop=True)

    try:
        features_df.to_csv("features_model.csv", index=False)
        print(f"Saved features_model.csv with {len(features_df)} rows.")
    except PermissionError:
        alt = "features_model_new.csv"
        features_df.to_csv(alt, index=False)
        print(
            f"Could not overwrite features_model.csv (maybe open in Excel). "
            f"Saved features to {alt} instead."
        )

    return features_df


# ================== EVENT STUDY: CAR(0,5) WITH CORRECT 0–5 LOGIC ================== #

def build_event_study(
    features_df: pd.DataFrame, px_raw: Dict[str, pd.DataFrame], ff_factors: pd.DataFrame
) -> pd.DataFrame:
    """
    For each event (ticker, day0), compute CAR over 0..5 using FF3:

      - Daily returns are:
            r_t = AdjClose_t / AdjClose_{t-1} - 1
      - Event window CAR(0,5) sums AR_0..AR_5:
            6 daily abnormal returns (day0..day5),
            which correspond to price move from day-1 close to day5 close.

      - Estimation window: -120..-20 relative to day0
      - Model: (ret - RF) ~ 1 + Mkt_RF + SMB + HML
      - ret = winsorised daily return from adj_close
    """
    if features_df.empty:
        event_df = pd.DataFrame(
            columns=[
                "ticker",
                "announce_date",
                "day0",
                "event_start",
                "event_end",
                "est_start",
                "est_end",
                "CAR_0_5",
            ]
        )
        event_df.to_csv("event_study_car_0_5.csv", index=False)
        print("Saved empty event_study_car_0_5.csv (no features).")
        return event_df

    ff = ff_factors.copy()
    ff["date"] = pd.to_datetime(ff["date"])
    ff = ff.sort_values("date").set_index("date")

    records: List[dict] = []

    for sym in TICKERS:
        px = px_raw.get(sym)
        if px is None or px.empty:
            continue

        df_px = px.copy()
        if not isinstance(df_px.index, pd.DatetimeIndex):
            if "date" in df_px.columns:
                df_px["date"] = pd.to_datetime(df_px["date"])
                df_px = df_px.set_index("date")
            else:
                df_px.index = pd.to_datetime(df_px.index)
        df_px = df_px.sort_index()

        common_dates = df_px.index.intersection(ff.index)
        common_dates = common_dates.sort_values()
        if len(common_dates) == 0:
            continue

        merged = pd.DataFrame(index=common_dates)
        merged["adj_close"] = df_px.loc[common_dates, "adj_close"].values
        merged["Mkt_RF"] = ff.loc[common_dates, "Mkt_RF"].values
        merged["SMB"] = ff.loc[common_dates, "SMB"].values
        merged["HML"] = ff.loc[common_dates, "HML"].values
        merged["RF"] = ff.loc[common_dates, "RF"].values
        merged["date"] = merged.index

        # daily returns (ret_t is associated with that day's close vs previous close)
        merged["ret_raw"] = merged["adj_close"].pct_change()
        merged["ret"] = winsorize_series(merged["ret_raw"], WINSOR_P)

        idx = merged.index

        ev_rows = features_df[features_df["ticker"] == sym]
        if ev_rows.empty:
            continue

        needed = ["ret", "Mkt_RF", "SMB", "HML", "RF"]

        for _, ev in ev_rows.iterrows():
            day0 = pd.to_datetime(ev["day0"])
            loc_candidates = np.where(idx == np.datetime64(day0))[0]
            if len(loc_candidates) == 0:
                continue
            loc0 = int(loc_candidates[0])

            # -------- EVENT WINDOW: 0..5 --------
            # we need valid returns at indices loc0..loc0+5
            # ret at index 0 is NaN because there is no previous day
            event_start_loc = loc0 + EVENT_WINDOW[0]  # should be loc0
            event_end_loc = loc0 + EVENT_WINDOW[1]    # loc0+5

            if event_start_loc < 1:  # need a previous day for ret at day0
                continue
            if event_end_loc >= len(merged):
                continue

            # -------- ESTIMATION WINDOW: -120..-20 --------
            est_end = loc0 - ESTIMATION_GAP
            est_start = est_end - ESTIMATION_LOOKBACK + 1
            if est_start < 1 or est_end >= len(merged):
                continue

            est = merged.iloc[est_start: est_end + 1].copy()
            if est[needed].isna().any().any():
                continue

            # Fit FF3 model on estimation window
            y = est["ret"] - est["RF"]
            X = np.column_stack(
                [
                    np.ones(len(est)),
                    est["Mkt_RF"],
                    est["SMB"],
                    est["HML"],
                ]
            )
            beta_hat, *_ = np.linalg.lstsq(X, y.values, rcond=None)

            # Event window rows: day0..day5
            ev_df = merged.iloc[event_start_loc: event_end_loc + 1].copy()
            if ev_df[needed].isna().any().any():
                continue

            X_ev = np.column_stack(
                [
                    np.ones(len(ev_df)),
                    ev_df["Mkt_RF"],
                    ev_df["SMB"],
                    ev_df["HML"],
                ]
            )
            excess_hat = X_ev @ beta_hat
            exp_ret = ev_df["RF"].values + excess_hat

            # abnormal returns for days 0..5
            abn = ev_df["ret"].values - exp_ret
            car_0_5 = float(abn.sum())

            records.append(
                {
                    "ticker": sym,
                    "announce_date": ev["announce_date"],
                    "day0": day0,
                    # these are the dates for day0 .. day5
                    "event_start": ev_df["date"].iloc[0],
                    "event_end": ev_df["date"].iloc[-1],
                    # estimation window dates (for debugging / trust)
                    "est_start": est["date"].iloc[0],
                    "est_end": est["date"].iloc[-1],
                    "CAR_0_5": car_0_5,
                }
            )

    if not records:
        event_df = pd.DataFrame(
            columns=[
                "ticker",
                "announce_date",
                "day0",
                "event_start",
                "event_end",
                "est_start",
                "est_end",
                "CAR_0_5",
            ]
        )
    else:
        event_df = pd.DataFrame(records)
        event_df = event_df.sort_values(["ticker", "day0"]).reset_index(drop=True)

    event_df.to_csv("event_study_car_0_5.csv", index=False)
    print(f"Saved event_study_car_0_5.csv with {len(event_df)} rows.")

    return event_df


# ================== MACRO CALENDAR (USE YOUR CLEAN CSV IF PRESENT) ================== #

def fetch_macro_calendar(start: str, end: str) -> pd.DataFrame:
    """
    Macro calendar logic:

    1) If macro_calendar_clean.csv exists:
         - Read it.
         - Parse dates with dayfirst=True.
         - Filter to [start, end].
         - Re-save as macro_calendar.csv with:
              date in dd/mm/YYYY
              event_type (CPI, FOMC, etc).

    2) If macro_calendar_clean.csv does not exist:
         - Try to build a CPI-only calendar from FRED.
         - Label all rows as event_type = "CPI".
         - Save as macro_calendar.csv.

    No fake “FOMC every day” rubbish.
    """
    start_dt = pd.to_datetime(start)
    end_dt = pd.to_datetime(end)

    clean_path = "macro_calendar_clean.csv"

    if os.path.exists(clean_path):
        print(f"Using local macro_calendar_clean.csv as macro source.")
        df = pd.read_csv(clean_path)
        if "date" not in df.columns or "event_type" not in df.columns:
            print("macro_calendar_clean.csv is missing 'date' or 'event_type' columns.")
            df_out = pd.DataFrame(columns=["date", "event_type"])
            df_out.to_csv("macro_calendar.csv", index=False)
            print("Saved empty macro_calendar.csv")
            return df_out

        df["date"] = pd.to_datetime(df["date"], dayfirst=True, errors="coerce")
        df = df.dropna(subset=["date"])
        df = df[(df["date"] >= start_dt) & (df["date"] <= end_dt)]

        if df.empty:
            df_out = pd.DataFrame(columns=["date", "event_type"])
        else:
            df_out = df[["date", "event_type"]].copy()
            df_out["date"] = df_out["date"].dt.strftime("%d/%m/%Y")

        try:
            df_out.to_csv("macro_calendar.csv", index=False)
            print(f"Saved macro_calendar.csv with {len(df_out)} rows (from macro_calendar_clean.csv).")
        except PermissionError:
            alt = "macro_calendar_new.csv"
            df_out.to_csv(alt, index=False)
            print(
                f"Could not overwrite macro_calendar.csv (maybe open in Excel). "
                f"Saved macro calendar to {alt} instead."
            )
        return df_out

    # Fallback: CPI-only from FRED (no FOMC)
    if not FRED_API_KEY:
        df_empty = pd.DataFrame(columns=["date", "event_type"])
        df_empty.to_csv("macro_calendar.csv", index=False)
        print("No macro_calendar_clean.csv and no FRED_API_KEY – saved empty macro_calendar.csv")
        return df_empty

    print("macro_calendar_clean.csv not found – building CPI-only calendar from FRED.")

    base = "https://api.stlouisfed.org/fred"
    common_params = {"api_key": FRED_API_KEY, "file_type": "json"}
    start_str = start_dt.strftime("%Y-%m-%d")
    end_str = end_dt.strftime("%Y-%m-%d")

    try:
        r = requests.get(base + "/releases", params=common_params, timeout=20)
        r.raise_for_status()
        rel_data = r.json()
        releases = rel_data.get("releases", [])
    except Exception as e:
        print(f"Error fetching FRED releases: {e}")
        df_empty = pd.DataFrame(columns=["date", "event_type"])
        df_empty.to_csv("macro_calendar.csv", index=False)
        return df_empty

    cpi_release_ids: List[int] = []
    for rel in releases:
        rid = rel.get("id")
        name = rel.get("name", "")
        if rid is None:
            continue
        nl = name.lower()
        if "consumer price index" in nl:
            cpi_release_ids.append(rid)

    records: List[dict] = []

    for rid in cpi_release_ids:
        params = {
            "api_key": FRED_API_KEY,
            "file_type": "json",
            "release_id": rid,
            "observation_start": start_str,
            "observation_end": end_str,
        }
        try:
            r2 = requests.get(base + "/release/dates", params=params, timeout=20)
            r2.raise_for_status()
            d2 = r2.json()
            for item in d2.get("release_dates", []):
                d_str = item.get("date")
                if not d_str:
                    continue
                ts = pd.to_datetime(d_str).normalize()
                records.append({"date": ts, "event_type": "CPI"})
        except Exception as e:
            print(f"Error fetching FRED dates for release {rid}: {e}")

    if not records:
        df = pd.DataFrame(columns=["date", "event_type"])
    else:
        df = pd.DataFrame(records)
        df = df.drop_duplicates().sort_values("date").reset_index(drop=True)

    if not df.empty:
        df["date"] = pd.to_datetime(df["date"]).dt.strftime("%d/%m/%Y")

    try:
        df.to_csv("macro_calendar.csv", index=False)
        print(f"Saved macro_calendar.csv with {len(df)} rows (CPI-only FRED fallback).")
    except PermissionError:
        alt = "macro_calendar_new.csv"
        df.to_csv(alt, index=False)
        print(
            f"Could not overwrite macro_calendar.csv (maybe open in Excel). "
            f"Saved macro calendar to {alt} instead."
        )

    return df


# ================== MAIN: PIPELINE ONLY (NO STRATEGY) ================== #

def main() -> None:
    start_str, end_str = get_date_range()
    print(f"Date range: {start_str} to {end_str}")

    print("\n--- Step 1: Download daily OHLCV prices ---")
    px_raw = download_all_prices(start_str, end_str)

    print("\n--- Step 2: Download Fama–French factors ---")
    ff_factors = fetch_ff_factors(start_str, end_str)

    print("\n--- Step 3: Download EPS from Alpha Vantage ---")
    start_dt = pd.to_datetime(start_str)
    end_dt = pd.to_datetime(end_str)
    eps_master = combine_all_eps_sources(start_dt, end_dt)

    print("\n--- Step 4: Build features table (eps_surprise_pct, pre_ret_3d) ---")
    features_df = build_features_table(eps_master, px_raw)

    print("\n--- Step 5: Build event study with CAR(0,5) ---")
    event_df = build_event_study(features_df, px_raw, ff_factors)

    print("\n--- Step 6: Build macro calendar (from macro_calendar_clean.csv if present) ---")
    macro_df = fetch_macro_calendar(start_str, end_str)

    print("\n--- Done (data pipeline only, no strategy yet) ---")
    print(f"Prices rows (all tickers): {sum(len(df) for df in px_raw.values())}")
    print(f"FF factors rows: {len(ff_factors)}")
    print(f"EPS master rows: {len(eps_master)}")
    print(f"Features rows: {len(features_df)}")
    print(f"Event study rows: {len(event_df)}")
    print(f"Macro calendar rows: {len(macro_df)}")


if __name__ == "__main__":
    main()
Date range: 2000-01-01 to 2025-11-20

--- Step 1: Download daily OHLCV prices ---
AAPL: got 6511 daily rows from yfinance
Saved AAPL.csv with 6511 rows.
NVDA: got 6511 daily rows from yfinance
Saved NVDA.csv with 6511 rows.
GOOGL: got 5349 daily rows from yfinance
Saved GOOGL.csv with 5349 rows.

--- Step 2: Download Fama–French factors ---
Fetching Fama–French 3 factors (daily)...
C:\Users\dcazo\AppData\Local\Temp\ipykernel_11148\3343574371.py:208: FutureWarning: The argument 'date_parser' is deprecated and will be removed in a future version. Please use 'date_format' instead, or read your data in as 'object' dtype and then call 'to_datetime'.
  ff3 = web.DataReader("F-F_Research_Data_Factors_Daily", "famafrench", start_dt)[0]
Saved ff_factors_daily.csv

--- Step 3: Download EPS from Alpha Vantage ---

Fetching EPS for AAPL from Alpha Vantage...
ALPHAVANTAGE_API_KEY not set – no EPS.

Fetching EPS for NVDA from Alpha Vantage...
ALPHAVANTAGE_API_KEY not set – no EPS.

Fetching EPS for GOOGL from Alpha Vantage...
ALPHAVANTAGE_API_KEY not set – no EPS.
No EPS from Alpha Vantage – using local eps_master.csv backup.
Using 282 EPS rows from local backup. Saved eventearnings.csv.

--- Step 4: Build features table (eps_surprise_pct, pre_ret_3d) ---
C:\Users\dcazo\AppData\Local\Temp\ipykernel_11148\3343574371.py:488: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  price_minus1 = float(px["adj_close"].iloc[loc_minus1])
C:\Users\dcazo\AppData\Local\Temp\ipykernel_11148\3343574371.py:489: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead
  price_minus4 = float(px["adj_close"].iloc[loc_minus4])
Saved features_model.csv with 282 rows.

--- Step 5: Build event study with CAR(0,5) ---
Saved event_study_car_0_5.csv with 273 rows.

--- Step 6: Build macro calendar (from macro_calendar_clean.csv if present) ---
macro_calendar_clean.csv not found – building CPI-only calendar from FRED.
Saved macro_calendar.csv with 940 rows (CPI-only FRED fallback).

--- Done (data pipeline only, no strategy yet) ---
Prices rows (all tickers): 18371
FF factors rows: 6475
EPS master rows: 282
Features rows: 282
Event study rows: 273
Macro calendar rows: 940
In [1]:
import pandas as pd
import numpy as np


# ======================= SETTINGS ======================= #

# Core data produced by auto_pipeline_and_backtest.py
FEATURES_PATH = "features_model.csv"         # eps_surprise_pct, pre_ret_3d, etc.
EVENT_STUDY_PATH = "event_study_car_0_5.csv"  # CAR_0_5 per event
MACRO_PATH = "macro_calendar.csv"           # macro dates to avoid (CPI / FOMC etc.)

PRICE_FILES = {
    "AAPL": "AAPL.csv",
    "NVDA": "NVDA.csv",
    "GOOGL": "GOOGL.csv",
}

# Features we use in the regression
FEATURE_COLS = ["eps_surprise_pct", "pre_ret_3d"]

# Training + trading universe
TRAIN_START_DATE = pd.Timestamp("2010-01-01")  # only train on events from here onwards
MIN_TRAIN_EVENTS = 80                          # minimum training events before we trade

# Transaction cost per leg per side (0.0005 = 0.05%)
COST_RATE = 0.0005

# Capital we pretend the fund allocates to this strategy
ASSUMED_CAPITAL = 10_000_000

# Avoid trading on macro dates (day0 in macro_calendar)
USE_MACRO_FILTER = True


# ======================= HELPERS ======================= #

def score_to_allocation_dollars(score: float) -> float:
    """
    Map the model score to a dollar position.
    Positive score -> long notional. Negative score -> short notional.
    Uses your preferred stepped ladder:

        |score| < 0.5      -> 0
        0.5 <= |score| < 0.7 -> 200k
        0.7 <= |score| < 0.9 -> 400k
        |score| >= 0.9       -> 600k
    """
    s = abs(score)
    if s < 0.3:
        alloc = 0.0
    elif s < 0.4:
        alloc = 200_000.0
    elif s < 0.5:
        alloc = 400_000.0
    else:
        alloc = 600_000.0

    return float(np.sign(score) * alloc)


def load_prices():
    """
    Load AAPL, NVDA, GOOGL daily prices from CSVs produced by the pipeline.

    Expected columns in each file:
      ticker, date, open, high, low, close, adj_close, volume
    """
    frames = []
    for ticker, path in PRICE_FILES.items():
        df = pd.read_csv(path)
        if "ticker" not in df.columns:
            df["ticker"] = ticker

        df["date"] = pd.to_datetime(df["date"]).dt.normalize()
        needed = ["ticker", "date", "open", "high", "low", "close", "adj_close", "volume"]
        missing = [c for c in needed if c not in df.columns]
        if missing:
            raise ValueError(f"{ticker}: missing columns {missing} in {path}")

        df = df[needed]
        frames.append(df)

    prices = pd.concat(frames, ignore_index=True)
    prices = prices.sort_values(["ticker", "date"])
    prices.set_index(["ticker", "date"], inplace=True)
    return prices


def load_macro_dates():
    """
    Load macro dates (CPI / FOMC etc.) and return a set of dates to avoid trading.
    """
    try:
        macro = pd.read_csv(MACRO_PATH)
    except FileNotFoundError:
        print("macro_calendar.csv not found – no macro filtering will be applied.")
        return set()

    if "date" not in macro.columns:
        print("macro_calendar.csv has no 'date' column – no macro filtering will be applied.")
        return set()

    macro["date"] = pd.to_datetime(macro["date"]).dt.normalize()
    unique_dates = set(macro["date"].unique())
    print(f"Loaded {len(unique_dates)} unique macro dates to avoid (day0 only).")
    return unique_dates


# ======================= MAIN BACKTEST ======================= #

def backtest_directional():
    # 1) Load features and event study
    features = pd.read_csv(FEATURES_PATH)
    events = pd.read_csv(EVENT_STUDY_PATH)

    # Ensure proper date types
    for df in (features, events):
        if "announce_date" in df.columns:
            df["announce_date"] = pd.to_datetime(df["announce_date"]).dt.normalize()
        if "day0" in df.columns:
            df["day0"] = pd.to_datetime(df["day0"]).dt.normalize()

    # Merge on ticker + dates
    merge_keys = ["ticker", "announce_date", "day0"]
    df = pd.merge(
        events,
        features,
        on=merge_keys,
        how="inner",
        suffixes=("", "_feat"),
    )

    # Keep only the columns we care about
    if "CAR_0_5" not in df.columns:
        raise ValueError("Expected 'CAR_0_5' column in event_study_car_0_5.csv")

    # Filter to training/trading universe (day0 >= TRAIN_START_DATE)
    df = df[df["day0"] >= TRAIN_START_DATE].copy()
    df = df.sort_values("day0").reset_index(drop=True)

    n_events = len(df)
    print(f"Total events in sample (day0 >= {TRAIN_START_DATE.date()}): {n_events}")
    if n_events == 0:
        print("No events after TRAIN_START_DATE – nothing to backtest.")
        return None

    # Drop rows with missing features
    df = df.dropna(subset=FEATURE_COLS + ["CAR_0_5"]).reset_index(drop=True)

    # 2) Load prices
    prices = load_prices()

    # 3) Macro dates
    macro_dates = load_macro_dates() if USE_MACRO_FILTER else set()

    records = []

    # 4) Walk forward through time (event by event)
    for i in range(len(df)):
        row = df.iloc[i]
        ticker = row["ticker"]
        day0 = row["day0"]
        announce_date = row["announce_date"]
        car_0_5 = float(row["CAR_0_5"])  # factor-adjusted CAR(0,5)
        x_i = row[FEATURE_COLS].values.astype(float)

        score = np.nan
        pos_dollars = 0.0
        pnl_dollars = 0.0
        raw_ret_0_5 = np.nan
        exit_date = pd.NaT
        skipped_macro = False

        # Build training sample: all prior events (by day0) after TRAIN_START_DATE
        train_mask = (df["day0"] < day0) & (df["day0"] >= TRAIN_START_DATE)
        train = df[train_mask]

        if len(train) >= MIN_TRAIN_EVENTS:
            X_train = train[FEATURE_COLS].values.astype(float)
            y_train = train["CAR_0_5"].values.astype(float)

            # Linear regression with intercept
            X_mat = np.column_stack([np.ones(len(X_train)), X_train])
            beta_hat, *_ = np.linalg.lstsq(X_mat, y_train, rcond=None)
            intercept = beta_hat[0]
            coef = beta_hat[1:]

            # Residual std and mean CAR on training sample
            y_hat_train = X_mat @ beta_hat
            resid = y_train - y_hat_train
            sigma_resid = resid.std(ddof=1)
            mean_car = y_train.mean()

            # Prediction for this event
            car_hat = intercept + np.dot(coef, x_i)
            car_feat = car_hat - mean_car
            score = car_feat / sigma_resid if sigma_resid > 0 else 0.0

            # Decide dollar position
            pos_dollars = score_to_allocation_dollars(score)

            # Macro filter: avoid trading on macro dates (day0)
            if USE_MACRO_FILTER and day0 in macro_dates:
                skipped_macro = True
                pos_dollars = 0.0

            # If we actually take a position, compute PnL using daily prices
            if pos_dollars != 0.0:
                try:
                    px_tkr = prices.loc[ticker]

                    # Entry at day0 open
                    row0 = px_tkr.loc[day0]
                    open0 = float(row0["open"])

                    # Take up to 6 trading days from day0 (day0..day5)
                    px_window = px_tkr.loc[day0:].iloc[:6]

                    # Exit at last available in that 0..5 window
                    row_exit = px_window.iloc[-1]
                    exit_date = row_exit.name
                    close_exit = float(row_exit["adj_close"])

                    raw_ret_0_5 = (close_exit - open0) / open0

                    # Trading costs (open + close)
                    trade_cost = 2.0 * COST_RATE * abs(pos_dollars)

                    # PnL in dollars
                    pnl_dollars = pos_dollars * raw_ret_0_5 - trade_cost

                except KeyError:
                    # Missing price data -> no trade
                    pos_dollars = 0.0
                    pnl_dollars = 0.0
                    raw_ret_0_5 = np.nan
                    exit_date = pd.NaT

        # Save record for this event
        records.append({
            "announce_date": announce_date,
            "day0": day0,
            "exit_date": exit_date,
            "ticker": ticker,
            "CAR_0_5": car_0_5,
            "score": score,
            "position_dollars": pos_dollars,
            "raw_ret_0_5": raw_ret_0_5,
            "pnl_dollars": pnl_dollars,
            "skipped_macro": skipped_macro,
        })

    bt = pd.DataFrame(records)

    # 5) Keep only actual trades
    trades = bt[bt["position_dollars"] != 0].copy()
    trades = trades.sort_values("day0").reset_index(drop=True)

    out_path = "backtest_directional_trades.csv"
    trades.to_csv(out_path, index=False)
    print(f"\nSaved trade details to {out_path}")

    n_trades = len(trades)
    print(f"\nNumber of trades: {n_trades}")
    if n_trades == 0:
        print("No trades taken with current settings.")
        return bt

    # 6) Dollar-level stats
    total_pnl = trades["pnl_dollars"].sum()
    avg_pnl = trades["pnl_dollars"].mean()
    med_pnl = trades["pnl_dollars"].median()
    std_pnl = trades["pnl_dollars"].std(ddof=1)
    hit_rate = (trades["pnl_dollars"] > 0).mean()
    worst = trades["pnl_dollars"].min()
    best = trades["pnl_dollars"].max()

    print(f"Total PnL: ${total_pnl:,.2f}")
    print(f"Average PnL per trade: ${avg_pnl:,.2f}")
    print(f"Median PnL per trade: ${med_pnl:,.2f}")
    print(f"Std dev PnL per trade: ${std_pnl:,.2f}")
    print(f"Hit rate: {hit_rate:.3f}")
    print(f"Worst trade: ${worst:,.2f}")
    print(f"Best trade: ${best:,.2f}")

    # Trades by size
    trades["abs_pos"] = trades["position_dollars"].abs()
    tier_counts = trades["abs_pos"].value_counts().sort_index()

    print("\nTrades by size:")
    for size, count in tier_counts.items():
        print(f"  ${int(size):,}: {count} trades")

    # 7) Returns as % of position
    trades["ret_pct_of_pos"] = trades["pnl_dollars"] / trades["position_dollars"].abs()

    avg_ret_pct = trades["ret_pct_of_pos"].mean()
    med_ret_pct = trades["ret_pct_of_pos"].median()
    std_ret_pct = trades["ret_pct_of_pos"].std(ddof=1)
    worst_ret_pct = trades["ret_pct_of_pos"].min()
    best_ret_pct = trades["ret_pct_of_pos"].max()

    print("\nReturn per trade as % of position:")
    print(f"Average: {100 * avg_ret_pct:.2f}%")
    print(f"Median:  {100 * med_ret_pct:.2f}%")
    print(f"Std dev: {100 * std_ret_pct:.2f}%")
    print(f"Worst:   {100 * worst_ret_pct:.2f}%")
    print(f"Best:    {100 * best_ret_pct:.2f}%")

    print(f"\nHit rate: {hit_rate*100:.1f}% of trades are profitable")

    # 8) Portfolio view on a 10m book
    total_pnl_pct_of_book = total_pnl / ASSUMED_CAPITAL

    start_date = trades["day0"].min()
    end_date = trades["exit_date"].max() if trades["exit_date"].notna().any() else trades["day0"].max()
    years = (end_date - start_date).days / 365.25 if pd.notna(end_date) else 0.0
    trades_per_year = n_trades / years if years > 0 else np.nan

    equity_final = 1.0 + total_pnl_pct_of_book
    ann_return = equity_final ** (1.0 / years) - 1.0 if years > 0 else np.nan

    print(f"\nPortfolio view assuming ${ASSUMED_CAPITAL:,} allocated to this strategy:")
    print(f"Total PnL as % of book: {100 * total_pnl_pct_of_book:.2f}%")
    print(f"Trading period: {start_date.date()} to {end_date.date()} (~{years:.2f} years)")
    print(f"Trades per year: {trades_per_year:.2f}")
    print(f"Approx annualised return on the book: {ann_return*100:.2f}%")

    # 9) Equity curve + simple annualised Sharpe on book
    equity = 1.0
    equity_curve = []
    for _, tr in trades.iterrows():
        equity *= (1.0 + tr["pnl_dollars"] / ASSUMED_CAPITAL)
        equity_curve.append(equity)

    equity_series = pd.Series(equity_curve, index=trades["exit_date"].reset_index(drop=True))
    peak = equity_series.cummax()
    drawdowns = equity_series / peak - 1.0
    max_drawdown = drawdowns.min()

    # Approx annual volatility from per-trade pnl on book
    per_trade_ret_on_book = trades["pnl_dollars"] / ASSUMED_CAPITAL
    std_per_trade = per_trade_ret_on_book.std(ddof=1)
    ann_vol = std_per_trade * np.sqrt(trades_per_year) if trades_per_year > 0 else np.nan
    sharpe_ann = (ann_return / ann_vol) if (ann_vol not in [0, np.nan] and years > 0) else np.nan

    print("\nEquity curve on 10m book:")
    print(f"Final capital (starting from 1.0): {equity_series.iloc[-1]:.6f}")
    print(f"Maximum drawdown: {max_drawdown:.6f}")
    print(f"Approx annual volatility: {ann_vol:.6f}")
    print(f"Approx annual Sharpe: {sharpe_ann:.3f}")

    # 10) Per-tier average % return
    print("\nPer-tier average return (as % of position):")
    for size, count in tier_counts.items():
        sub = trades[trades["abs_pos"] == size]
        avg_pct_tier = (sub["pnl_dollars"] / sub["position_dollars"].abs()).mean()
        print(f"  Size ${int(size):,}: n={count}, avg = {100 * avg_pct_tier:.2f}%")

    return bt


if __name__ == "__main__":
    backtest_directional()
Total events in sample (day0 >= 2010-01-01): 189
Loaded 940 unique macro dates to avoid (day0 only).
C:\Users\dcazo\AppData\Local\Temp\ipykernel_11640\3952785008.py:103: UserWarning: Parsing dates in %d/%m/%Y format when dayfirst=False (the default) was specified. Pass `dayfirst=True` or specify a format to silence this warning.
  macro["date"] = pd.to_datetime(macro["date"]).dt.normalize()
Saved trade details to backtest_directional_trades.csv

Number of trades: 13
Total PnL: $258,254.62
Average PnL per trade: $19,865.74
Median PnL per trade: $14,111.36
Std dev PnL per trade: $16,718.96
Hit rate: 0.846
Worst trade: $-1,500.84
Best trade: $51,908.29

Trades by size:
  $200,000: 8 trades
  $400,000: 2 trades
  $600,000: 3 trades

Return per trade as % of position:
Average: 6.06%
Median:  5.98%
Std dev: 4.68%
Worst:   -0.75%
Best:    15.40%

Hit rate: 84.6% of trades are profitable

Portfolio view assuming $10,000,000 allocated to this strategy:
Total PnL as % of book: 2.58%
Trading period: 2016-11-11 to 2024-02-29 (~7.30 years)
Trades per year: 1.78
Approx annualised return on the book: 0.35%

Equity curve on 10m book:
Final capital (starting from 1.0): 1.026118
Maximum drawdown: -0.000150
Approx annual volatility: 0.002231
Approx annual Sharpe: 1.568

Per-tier average return (as % of position):
  Size $200,000: n=8, avg = 6.26%
  Size $400,000: n=2, avg = 3.49%
  Size $600,000: n=3, avg = 7.24%
In [3]:
import math
import numpy as np
import pandas as pd

# ================== SETTINGS ================== #

EVENT_STUDY_PATH = "event_study_car_0_5.csv"
FEATURES_PATH = "features_model.csv"
MACRO_CALENDAR_PATH = "macro_calendar.csv"

PRICE_FILES = {
    "AAPL": "AAPL.csv",
    "NVDA": "NVDA.csv",
    "GOOGL": "GOOGL.csv",
}

ASSUMED_CAPITAL = 10_000_000
COST_RATE = 0.0005

FEATURE_COLS = ["eps_surprise_pct", "pre_ret_3d"]

TRAIN_START_DATE = pd.Timestamp("2010-01-01")
MIN_TRAIN_EVENTS = 80

TUNING_END_DATE = pd.Timestamp("2018-12-31")
FORWARD_START_DATE = pd.Timestamp("2019-01-01")


# =================================================
# Macro calendar loader
# =================================================

def load_macro_dates(path: str = MACRO_CALENDAR_PATH) -> set:
    try:
        macro = pd.read_csv(path)
    except FileNotFoundError:
        print("Macro calendar file not found – no macro filter applied.")
        return set()

    if "date" not in macro.columns:
        print("Macro calendar has no 'date' column – no macro filter applied.")
        return set()

    # Your macro calendar is dd/mm/YYYY
    macro["date"] = pd.to_datetime(macro["date"], dayfirst=True, errors="coerce").dt.normalize()
    dates = set(macro["date"].dropna().unique())
    print(f"Loaded {len(dates)} unique macro dates to avoid (day0 only).")
    return dates


# =================================================
# Price loader – tailored to your pipeline files
# =================================================

def load_prices() -> pd.DataFrame:
    """
    Load AAPL/NVDA/GOOGL from separate CSVs and standardise to:
    index = [ticker, date]
    cols  = open, high, low, close, adj_close, volume
    """
    all_frames = []

    for sym, path in PRICE_FILES.items():
        df = pd.read_csv(path)

        # --- date handling ---
        if "date" in df.columns:
            df["date"] = pd.to_datetime(df["date"]).dt.normalize()
        elif "Date" in df.columns:
            df.rename(columns={"Date": "date"}, inplace=True)
            df["date"] = pd.to_datetime(df["date"]).dt.normalize()
        else:
            # assume first column is date
            first = df.columns[0]
            df.rename(columns={first: "date"}, inplace=True)
            df["date"] = pd.to_datetime(df["date"]).dt.normalize()

        # --- add ticker if missing ---
        if "ticker" not in df.columns:
            df["ticker"] = sym
        else:
            # normalise ticker in case it's mixed case
            df["ticker"] = df["ticker"].fillna(sym).astype(str)

        # --- unify column names ---
        cols = {c.lower(): c for c in df.columns}

        def rename_if_exists(old, new):
            if old in df.columns:
                df.rename(columns={old: new}, inplace=True)

        # yfinance-style to lower snake
        rename_if_exists("Open", "open")
        rename_if_exists("High", "high")
        rename_if_exists("Low", "low")
        rename_if_exists("Close", "close")
        rename_if_exists("Adj Close", "adj_close")
        rename_if_exists("Adj close", "adj_close")
        rename_if_exists("Adj_Close", "adj_close")

        # ensure essential columns exist; if open/high/low missing, copy close
        if "close" not in df.columns:
            raise ValueError(f"{sym}: no 'close' column found in {path}")

        if "open" not in df.columns:
            df["open"] = df["close"]
        if "high" not in df.columns:
            df["high"] = df["close"]
        if "low" not in df.columns:
            df["low"] = df["close"]

        if "adj_close" not in df.columns:
            df["adj_close"] = df["close"]

        if "volume" not in df.columns:
            df["volume"] = np.nan

        # keep only what we need
        df = df[["ticker", "date", "open", "high", "low", "close", "adj_close", "volume"]].copy()

        # enforce numeric on price columns
        for c in ["open", "high", "low", "close", "adj_close", "volume"]:
            df[c] = pd.to_numeric(df[c], errors="coerce")

        df = df.sort_values("date").reset_index(drop=True)
        all_frames.append(df)

    prices = pd.concat(all_frames, ignore_index=True)
    prices = prices.sort_values(["ticker", "date"])
    prices.set_index(["ticker", "date"], inplace=True)
    return prices


# =================================================
# Score -> dollar position
# (your "lowered" thresholds)
# =================================================

def score_to_allocation_dollars(score: float) -> float:
    s = abs(score)
    if s < 0.30:
        alloc = 0.0
    elif s < 0.40:
        alloc = 200_000.0
    elif s < 0.50:
        alloc = 400_000.0
    else:
        alloc = 600_000.0
    return float(np.sign(score) * alloc)


# =================================================
# Trade summary helper
# =================================================

def summarise_trades(trades: pd.DataFrame, label: str):
    print(f"\n================ {label} ================")

    n_trades = len(trades)
    print(f"Number of trades: {n_trades}")
    if n_trades == 0:
        return

    total_pnl = trades["pnl_dollars"].sum()
    avg_pnl = trades["pnl_dollars"].mean()
    med_pnl = trades["pnl_dollars"].median()
    std_pnl = trades["pnl_dollars"].std(ddof=1)
    hit_rate = (trades["pnl_dollars"] > 0).mean()
    worst = trades["pnl_dollars"].min()
    best = trades["pnl_dollars"].max()

    print(f"Total PnL: ${total_pnl:,.2f}")
    print(f"Average PnL per trade: ${avg_pnl:,.2f}")
    print(f"Median PnL per trade: ${med_pnl:,.2f}")
    print(f"Std dev PnL per trade: ${std_pnl:,.2f}")
    print(f"Hit rate: {hit_rate:.3f}")
    print(f"Worst trade: ${worst:,.2f}")
    print(f"Best trade: ${best:,.2f}")

    trades = trades.copy()
    trades["abs_pos"] = trades["position_dollars"].abs()
    tier_counts = trades["abs_pos"].value_counts().sort_index()

    print("\nTrades by size:")
    for size, count in tier_counts.items():
        print(f"  ${int(size):,}: {count} trades")

    trades["ret_pct_of_pos"] = trades["pnl_dollars"] / trades["position_dollars"].abs()
    avg_ret_pct = trades["ret_pct_of_pos"].mean()
    med_ret_pct = trades["ret_pct_of_pos"].median()
    std_ret_pct = trades["ret_pct_of_pos"].std(ddof=1)
    worst_ret_pct = trades["ret_pct_of_pos"].min()
    best_ret_pct = trades["ret_pct_of_pos"].max()

    print("\nReturn per trade as % of position:")
    print(f"Average: {100 * avg_ret_pct:.2f}%")
    print(f"Median:  {100 * med_ret_pct:.2f}%")
    print(f"Std dev: {100 * std_ret_pct:.2f}%")
    print(f"Worst:   {100 * worst_ret_pct:.2f}%")
    print(f"Best:    {100 * best_ret_pct:.2f}%")

    print(f"\nHit rate: {hit_rate*100:.1f}% of trades are profitable")

    trades = trades.sort_values("day0").reset_index(drop=True)
    start_date = trades["day0"].min()
    end_date = trades["exit_date"].max() if trades["exit_date"].notna().any() else trades["day0"].max()

    if pd.isna(start_date) or pd.isna(end_date):
        years = np.nan
    else:
        years = (end_date - start_date).days / 365.25

    total_pnl_pct_of_book = total_pnl / ASSUMED_CAPITAL
    equity_final = 1.0 + total_pnl_pct_of_book
    if years and years > 0:
        ann_return = equity_final ** (1.0 / years) - 1.0
    else:
        ann_return = np.nan

    # Equity curve & drawdown
    r = trades["pnl_dollars"] / ASSUMED_CAPITAL
    capital = 1.0
    peak = 1.0
    max_drawdown = 0.0
    for rr in r:
        capital *= (1.0 + rr)
        if capital > peak:
            peak = capital
        dd = capital / peak - 1.0
        if dd < max_drawdown:
            max_drawdown = dd

    trades_per_year = n_trades / years if years and years > 0 else np.nan
    ann_vol = r.std(ddof=1) * math.sqrt(trades_per_year) if trades_per_year and trades_per_year > 0 else np.nan
    sharpe = ann_return / ann_vol if ann_vol and ann_vol > 0 else np.nan

    print(f"\nPortfolio view on ${ASSUMED_CAPITAL:,}:")
    print(f"Total PnL as % of book: {100 * total_pnl_pct_of_book:.2f}%")
    if years and years > 0:
        print(f"Trading period: {start_date.date()} to {end_date.date()} (~{years:.2f} years)")
    print(f"Trades per year: {trades_per_year:.2f}" if trades_per_year == trades_per_year else "Trades per year: n/a")
    print(f"Approx annualised return on the book: {ann_return*100:.2f}%")
    print(f"Final capital (starting from 1.0): {capital:.6f}")
    print(f"Maximum drawdown: {max_drawdown:.6f}")
    print(f"Approx annual volatility: {ann_vol:.6f}" if ann_vol == ann_vol else "Approx annual volatility: n/a")
    print(f"Approx annual Sharpe: {sharpe:.3f}" if sharpe == sharpe else "Approx annual Sharpe: n/a")

    print("\nPer-tier average return (as % of position):")
    for size, count in tier_counts.items():
        sub = trades[trades["abs_pos"] == size]
        avg_pct_tier = (sub["pnl_dollars"] / sub["position_dollars"].abs()).mean()
        print(f"  Size ${int(size):,}: n={count}, avg = {100 * avg_pct_tier:.2f}%")


# =================================================
# Main backtest
# =================================================

def backtest_directional_split():
    # ---- load event study ----
    event_df = pd.read_csv(EVENT_STUDY_PATH)

    # normalise date cols if present
    for col in ["announce_date", "day0", "event_start", "event_end", "est_start", "est_end"]:
        if col in event_df.columns:
            event_df[col] = pd.to_datetime(event_df[col]).dt.normalize()

    # choose CAR column
    if "CAR_0_5" in event_df.columns:
        car_col = "CAR_0_5"
    elif "CAR" in event_df.columns:
        car_col = "CAR"
    else:
        raise ValueError("event_study_car_0_5.csv must have CAR_0_5 or CAR column.")

    event_df.rename(columns={car_col: "CAR_USED"}, inplace=True)

    # ---- load features ----
    feat_df = pd.read_csv(FEATURES_PATH)
    for col in ["announce_date", "day0"]:
        if col in feat_df.columns:
            feat_df[col] = pd.to_datetime(feat_df[col]).dt.normalize()

    merge_keys = ["ticker", "announce_date", "day0"]
    df = pd.merge(event_df, feat_df, on=merge_keys, how="inner")

    df = df[df["day0"] >= TRAIN_START_DATE].copy()
    df = df.sort_values("day0").reset_index(drop=True)
    print(f"Total events in sample (day0 >= {TRAIN_START_DATE.date()}): {len(df)}")

    macro_dates = load_macro_dates(MACRO_CALENDAR_PATH)
    prices = load_prices()

    records = []

    for i in range(len(df)):
        row = df.iloc[i]
        ticker = row["ticker"]
        day0 = pd.to_datetime(row["day0"]).normalize()
        announce_date = row["announce_date"]
        car_0_5 = float(row["CAR_USED"])

        x_i = row[FEATURE_COLS].values.astype(float)

        score = np.nan
        pos_dollars = 0.0
        pnl_dollars = 0.0
        raw_ret_0_5 = np.nan
        exit_date = pd.NaT

        # training = all past events
        train = df.iloc[:i].dropna(subset=FEATURE_COLS + ["CAR_USED"])
        if len(train) >= MIN_TRAIN_EVENTS:
            X_train = train[FEATURE_COLS].values.astype(float)
            y_train = train["CAR_USED"].values.astype(float)

            X_mat = np.column_stack([np.ones(len(X_train)), X_train])
            beta_hat, *_ = np.linalg.lstsq(X_mat, y_train, rcond=None)
            intercept = beta_hat[0]
            coef = beta_hat[1:]

            y_hat_train = X_mat @ beta_hat
            resid = y_train - y_hat_train
            sigma_resid = resid.std(ddof=1)
            mean_car = y_train.mean()

            car_hat = intercept + np.dot(coef, x_i)
            car_feat = car_hat - mean_car
            score = car_feat / sigma_resid if sigma_resid > 0 else 0.0

            # macro filter: skip if day0 is macro date
            if day0 not in macro_dates:
                pos_dollars = score_to_allocation_dollars(score)
            else:
                pos_dollars = 0.0

            if pos_dollars != 0.0:
                try:
                    px_tkr = prices.loc[ticker]  # index = date

                    row0 = px_tkr.loc[day0]
                    open0 = float(row0["open"])

                    px_window = px_tkr.loc[day0:]
                    px_window = px_window.iloc[:6]  # up to day0..day5

                    row_exit = px_window.iloc[-1]
                    exit_date = pd.to_datetime(row_exit.name).normalize()
                    close_exit = float(row_exit["adj_close"])

                    raw_ret_0_5 = (close_exit - open0) / open0

                    trade_cost = 2.0 * COST_RATE * abs(pos_dollars)
                    pnl_dollars = pos_dollars * raw_ret_0_5 - trade_cost

                except KeyError:
                    pos_dollars = 0.0
                    pnl_dollars = 0.0
                    raw_ret_0_5 = np.nan
                    exit_date = pd.NaT

        records.append({
            "ticker": ticker,
            "announce_date": announce_date,
            "day0": day0,
            "exit_date": exit_date,
            "CAR_0_5": car_0_5,
            "score": score,
            "position_dollars": pos_dollars,
            "raw_ret_0_5": raw_ret_0_5,
            "pnl_dollars": pnl_dollars,
        })

    bt = pd.DataFrame(records)
    bt.to_csv("backtest_directional_split_all_events.csv", index=False)

    trades = bt[bt["position_dollars"] != 0].copy()
    trades = trades.sort_values("day0").reset_index(drop=True)
    trades.to_csv("backtest_directional_split_trades.csv", index=False)
    print("\nSaved trade details to backtest_directional_split_trades.csv")

    # full period
    summarise_trades(trades, "FULL PERIOD (2010+)")
    # tuning
    tuning_trades = trades[trades["day0"] <= TUNING_END_DATE].copy()
    summarise_trades(tuning_trades, "TUNING PERIOD (2010–2018)")
    # forward
    fwd_trades = trades[trades["day0"] >= FORWARD_START_DATE].copy()
    summarise_trades(fwd_trades, "FORWARD TEST (2019–2024+)")

    return bt, trades


if __name__ == "__main__":
    backtest_directional_split()
Total events in sample (day0 >= 2010-01-01): 189
Loaded 940 unique macro dates to avoid (day0 only).

Saved trade details to backtest_directional_split_trades.csv

================ FULL PERIOD (2010+) ================
Number of trades: 13
Total PnL: $258,254.62
Average PnL per trade: $19,865.74
Median PnL per trade: $14,111.36
Std dev PnL per trade: $16,718.96
Hit rate: 0.846
Worst trade: $-1,500.84
Best trade: $51,908.29

Trades by size:
  $200,000: 8 trades
  $400,000: 2 trades
  $600,000: 3 trades

Return per trade as % of position:
Average: 6.06%
Median:  5.98%
Std dev: 4.68%
Worst:   -0.75%
Best:    15.40%

Hit rate: 84.6% of trades are profitable

Portfolio view on $10,000,000:
Total PnL as % of book: 2.58%
Trading period: 2016-11-11 to 2024-02-29 (~7.30 years)
Trades per year: 1.78
Approx annualised return on the book: 0.35%
Final capital (starting from 1.0): 1.026118
Maximum drawdown: -0.000150
Approx annual volatility: 0.002231
Approx annual Sharpe: 1.568

Per-tier average return (as % of position):
  Size $200,000: n=8, avg = 6.26%
  Size $400,000: n=2, avg = 3.49%
  Size $600,000: n=3, avg = 7.24%

================ TUNING PERIOD (2010–2018) ================
Number of trades: 2
Total PnL: $44,913.63
Average PnL per trade: $22,456.82
Median PnL per trade: $22,456.82
Std dev PnL per trade: $11,802.25
Hit rate: 1.000
Worst trade: $14,111.36
Best trade: $30,802.27

Trades by size:
  $200,000: 2 trades

Return per trade as % of position:
Average: 11.23%
Median:  11.23%
Std dev: 5.90%
Worst:   7.06%
Best:    15.40%

Hit rate: 100.0% of trades are profitable

Portfolio view on $10,000,000:
Total PnL as % of book: 0.45%
Trading period: 2016-11-11 to 2018-11-26 (~2.04 years)
Trades per year: 0.98
Approx annualised return on the book: 0.22%
Final capital (starting from 1.0): 1.004496
Maximum drawdown: 0.000000
Approx annual volatility: 0.001169
Approx annual Sharpe: 1.882

Per-tier average return (as % of position):
  Size $200,000: n=2, avg = 11.23%

================ FORWARD TEST (2019–2024+) ================
Number of trades: 11
Total PnL: $213,340.99
Average PnL per trade: $19,394.64
Median PnL per trade: $13,853.42
Std dev PnL per trade: $17,886.09
Hit rate: 0.818
Worst trade: $-1,500.84
Best trade: $51,908.29

Trades by size:
  $200,000: 6 trades
  $400,000: 2 trades
  $600,000: 3 trades

Return per trade as % of position:
Average: 5.12%
Median:  5.82%
Std dev: 4.06%
Worst:   -0.75%
Best:    12.86%

Hit rate: 81.8% of trades are profitable

Portfolio view on $10,000,000:
Total PnL as % of book: 2.13%
Trading period: 2019-02-15 to 2024-02-29 (~5.04 years)
Trades per year: 2.18
Approx annualised return on the book: 0.42%
Final capital (starting from 1.0): 1.021526
Maximum drawdown: -0.000150
Approx annual volatility: 0.002643
Approx annual Sharpe: 1.589

Per-tier average return (as % of position):
  Size $200,000: n=6, avg = 4.60%
  Size $400,000: n=2, avg = 3.49%
  Size $600,000: n=3, avg = 7.24%
In [ ]: